I. Preprocessing Pipeline¶

  1. Load Historical Market Data
In [ ]:
import yfinance as yf
import pandas as pd
import numpy as np

ticker = "^GSPC"
df = yf.download(ticker, start="2015-01-01", end="2025-01-01")
df.head()
/tmp/ipython-input-4-3101164333.py:6: FutureWarning: YF.download() has changed argument auto_adjust default to True
  df = yf.download(ticker, start="2015-01-01", end="2025-01-01")
[*********************100%***********************]  1 of 1 completed
Out[ ]:
Price Close High Low Open Volume
Ticker ^GSPC ^GSPC ^GSPC ^GSPC ^GSPC
Date
2015-01-02 2058.199951 2072.360107 2046.040039 2058.899902 2708700000
2015-01-05 2020.579956 2054.439941 2017.339966 2054.439941 3799120000
2015-01-06 2002.609985 2030.250000 1992.439941 2022.150024 4460110000
2015-01-07 2025.900024 2029.609985 2005.550049 2005.550049 3805480000
2015-01-08 2062.139893 2064.080078 2030.609985 2030.609985 3934010000
  • Source: Yahoo Finance via yfinance API

  • Ticker Used: ^GSPC (S&P 500 Index), representing broad market trends

  • Date Range: January 1, 2015 – December 31, 2024

  • Columns: Open, High, Low, Close, Volume

Saving Original Dataset

In [ ]:
raw_df = yf.download("^GSPC", start="2015-01-01", end="2025-01-01")
raw_df.to_csv("project_dataset_sp500_raw.csv")
/tmp/ipython-input-5-137008132.py:1: FutureWarning: YF.download() has changed argument auto_adjust default to True
  raw_df = yf.download("^GSPC", start="2015-01-01", end="2025-01-01")

[*********************100%***********************]  1 of 1 completed
  1. Compute Log Returns
  • Log returns provide stationarity and are commonly used in financial modeling.

  • Formula: log(Close_t / Close_t-1)

In [ ]:
df.columns = df.columns.get_level_values(0)
df["LogReturn"] = np.log(df["Close"] / df["Close"].shift(1))
df.head(8)
Out[ ]:
Price Close High Low Open Volume LogReturn
Date
2015-01-02 2058.199951 2072.360107 2046.040039 2058.899902 2708700000 NaN
2015-01-05 2020.579956 2054.439941 2017.339966 2054.439941 3799120000 -0.018447
2015-01-06 2002.609985 2030.250000 1992.439941 2022.150024 4460110000 -0.008933
2015-01-07 2025.900024 2029.609985 2005.550049 2005.550049 3805480000 0.011563
2015-01-08 2062.139893 2064.080078 2030.609985 2030.609985 3934010000 0.017730
2015-01-09 2044.810059 2064.429932 2038.329956 2063.449951 3364140000 -0.008439
2015-01-12 2028.260010 2049.300049 2022.579956 2046.130005 3456460000 -0.008127
2015-01-13 2023.030029 2056.929932 2008.250000 2031.579956 4107300000 -0.002582
  1. Compute Technical Indicators

These features provide the model with signals on trend, momentum, and potential turning points.

a. Relative Strength Index (RSI) — Momentum Indicator

  • Measures recent gain/loss to detect overbought/oversold conditions.
In [ ]:
def compute_rsi(series, window=14):
    delta = series.diff()
    gain = delta.clip(lower=0).rolling(window).mean()
    loss = -delta.clip(upper=0).rolling(window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

df["RSI"] = compute_rsi(df["Close"])
df.head(20)
Out[ ]:
Price Close High Low Open Volume LogReturn RSI
Date
2015-01-02 2058.199951 2072.360107 2046.040039 2058.899902 2708700000 NaN NaN
2015-01-05 2020.579956 2054.439941 2017.339966 2054.439941 3799120000 -0.018447 NaN
2015-01-06 2002.609985 2030.250000 1992.439941 2022.150024 4460110000 -0.008933 NaN
2015-01-07 2025.900024 2029.609985 2005.550049 2005.550049 3805480000 0.011563 NaN
2015-01-08 2062.139893 2064.080078 2030.609985 2030.609985 3934010000 0.017730 NaN
2015-01-09 2044.810059 2064.429932 2038.329956 2063.449951 3364140000 -0.008439 NaN
2015-01-12 2028.260010 2049.300049 2022.579956 2046.130005 3456460000 -0.008127 NaN
2015-01-13 2023.030029 2056.929932 2008.250000 2031.579956 4107300000 -0.002582 NaN
2015-01-14 2011.270020 2018.400024 1988.439941 2018.400024 4378680000 -0.005830 NaN
2015-01-15 1992.670044 2021.349976 1991.469971 2013.750000 4276720000 -0.009291 NaN
2015-01-16 2019.420044 2020.459961 1988.119995 1992.250000 4056410000 0.013335 NaN
2015-01-20 2022.550049 2028.939941 2004.489990 2020.760010 3944340000 0.001549 NaN
2015-01-21 2032.119995 2038.290039 2012.040039 2020.189941 3730070000 0.004720 NaN
2015-01-22 2063.149902 2064.620117 2026.380005 2034.300049 4176050000 0.015154 NaN
2015-01-23 2051.820068 2062.979980 2050.540039 2062.979980 3573560000 -0.005507 48.802572
2015-01-26 2057.090088 2057.620117 2040.969971 2050.419922 3465760000 0.002565 57.799662
2015-01-27 2029.550049 2047.859985 2019.910034 2047.859985 3329810000 -0.013478 55.529127
2015-01-28 2002.160034 2042.489990 2001.489990 2032.339966 4067530000 -0.013588 45.208292
2015-01-29 2021.250000 2024.640015 1989.180054 2002.449951 4127140000 0.009490 41.132852
2015-01-30 1994.989990 2023.319946 1993.380005 2019.349976 4568650000 -0.013077 39.599140

b. MACD (Moving Average Convergence Divergence) — Trend Change Indicator

  • Difference between fast and slow EMAs of the closing price
In [ ]:
def compute_macd(series, fast=12, slow=26, signal=9):
    ema_fast = series.ewm(span=fast, adjust=False).mean()
    ema_slow = series.ewm(span=slow, adjust=False).mean()
    macd = ema_fast - ema_slow
    signal_line = macd.ewm(span=signal, adjust=False).mean()
    return macd, signal_line

df["MACD"], df["MACD_Signal"] = compute_macd(df["Close"])
df.head(20)
Out[ ]:
Price Close High Low Open Volume LogReturn RSI MACD MACD_Signal
Date
2015-01-02 2058.199951 2072.360107 2046.040039 2058.899902 2708700000 NaN NaN 0.000000 0.000000
2015-01-05 2020.579956 2054.439941 2017.339966 2054.439941 3799120000 -0.018447 NaN -3.001025 -0.600205
2015-01-06 2002.609985 2030.250000 1992.439941 2022.150024 4460110000 -0.008933 NaN -6.751558 -1.830476
2015-01-07 2025.900024 2029.609985 2005.550049 2005.550049 3805480000 0.011563 NaN -7.755174 -3.015415
2015-01-08 2062.139893 2064.080078 2030.609985 2030.609985 3934010000 0.017730 NaN -5.562175 -3.524767
2015-01-09 2044.810059 2064.429932 2038.329956 2063.449951 3364140000 -0.008439 NaN -5.163064 -3.852427
2015-01-12 2028.260010 2049.300049 2022.579956 2046.130005 3456460000 -0.008127 NaN -6.111763 -4.304294
2015-01-13 2023.030029 2056.929932 2008.250000 2031.579956 4107300000 -0.002582 NaN -7.202603 -4.883956
2015-01-14 2011.270020 2018.400024 1988.439941 2018.400024 4378680000 -0.005830 NaN -8.913289 -5.689822
2015-01-15 1992.670044 2021.349976 1991.469971 2013.750000 4276720000 -0.009291 NaN -11.635753 -6.879009
2015-01-16 2019.420044 2020.459961 1988.119995 1992.250000 4056410000 0.013335 NaN -11.502233 -7.803654
2015-01-20 2022.550049 2028.939941 2004.489990 2020.760010 3944340000 0.001549 NaN -11.016857 -8.446294
2015-01-21 2032.119995 2038.290039 2012.040039 2020.189941 3730070000 0.004720 NaN -9.747614 -8.706558
2015-01-22 2063.149902 2064.620117 2026.380005 2034.300049 4176050000 0.015154 NaN -6.166789 -8.198604
2015-01-23 2051.820068 2062.979980 2050.540039 2062.979980 3573560000 -0.005507 48.802572 -4.194826 -7.397849
2015-01-26 2057.090088 2057.620117 2040.969971 2050.419922 3465760000 0.002565 57.799662 -2.181637 -6.354606
2015-01-27 2029.550049 2047.859985 2019.910034 2047.859985 3329810000 -0.013478 55.529127 -2.776416 -5.638968
2015-01-28 2002.160034 2042.489990 2001.489990 2032.339966 4067530000 -0.013588 45.208292 -5.395729 -5.590320
2015-01-29 2021.250000 2024.640015 1989.180054 2002.449951 4127140000 0.009490 41.132852 -5.863561 -5.644969
2015-01-30 1994.989990 2023.319946 1993.380005 2019.349976 4568650000 -0.013077 39.599140 -8.258091 -6.167593
  1. Handle Missing Values
  • Drop rows with NaN values due to rolling indicators or first-day return computation.
In [ ]:
df.dropna(inplace=True)
df.head()
Out[ ]:
Price Close High Low Open Volume LogReturn RSI MACD MACD_Signal
Date
2015-01-23 2051.820068 2062.979980 2050.540039 2062.979980 3573560000 -0.005507 48.802572 -4.194826 -7.397849
2015-01-26 2057.090088 2057.620117 2040.969971 2050.419922 3465760000 0.002565 57.799662 -2.181637 -6.354606
2015-01-27 2029.550049 2047.859985 2019.910034 2047.859985 3329810000 -0.013478 55.529127 -2.776416 -5.638968
2015-01-28 2002.160034 2042.489990 2001.489990 2032.339966 4067530000 -0.013588 45.208292 -5.395729 -5.590320
2015-01-29 2021.250000 2024.640015 1989.180054 2002.449951 4127140000 0.009490 41.132852 -5.863561 -5.644969
  1. Save the Processed Dataset
In [ ]:
df.to_csv("project_dataset_sp500_processed.csv", index=True)

II. Feature Normalization¶

  • Apply standardization to features like RSI, MACD, Volume, etc.

  • This step improves neural network convergence.

In [ ]:
from sklearn.preprocessing import StandardScaler

# Copy the original DataFrame
df_scaled = df.copy()

# Select the columns to normalize
feature_cols = ["LogReturn", "RSI", "MACD", "MACD_Signal", "Volume"]

# Fit and transform
scaler = StandardScaler()
df_scaled[feature_cols] = scaler.fit_transform(df_scaled[feature_cols])
df_scaled[["LogReturn", "RSI", "MACD", "MACD_Signal", "Volume"]].head()
Out[ ]:
Price LogReturn RSI MACD MACD_Signal Volume
Date
2015-01-23 -0.525811 -0.487923 -0.414165 -0.537220 -0.451163
2015-01-26 0.190475 0.072745 -0.359326 -0.506722 -0.563553
2015-01-27 -1.233205 -0.068747 -0.375528 -0.485801 -0.705291
2015-01-28 -1.242897 -0.711906 -0.446877 -0.484379 0.063839
2015-01-29 0.804935 -0.965874 -0.459621 -0.485977 0.125987

Save the Cleaned & Scaled Dataset

In [ ]:
df_scaled.to_csv("project_dataset_sp500_processed_cleaned_scaled.csv", index=True)

III. Exploratory Data Analysis & Visualization¶

  1. Distribution of Target Variable
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 4))
sns.histplot(df_scaled["LogReturn"], bins=50, kde=True, color='skyblue')
plt.title("Distribution of Log Returns")
plt.xlabel("Log Return")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
No description has been provided for this image

This histogram shows the distribution of daily log returns. The distribution is sharply peaked around 0 and exhibits fat tails, reflecting the presence of extreme market movements (e.g., crashes or rallies). Such behavior aligns with known stylized facts in financial econometrics and motivates the need for uncertainty-aware forecasting.

  1. Correlation Between Features
In [ ]:
plt.figure(figsize=(8, 6))
sns.heatmap(df_scaled[feature_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap of Input Features")
plt.show()
No description has been provided for this image

This heatmap shows Pearson correlations between engineered features. MACD and its signal are highly correlated as expected. RSI shows moderate correlation with trend indicators. Volume is weakly inversely related to other features and may contribute to detecting volatility regimes.

  1. Effect of Feature Scaling
In [ ]:
plt.figure(figsize=(10, 4))
plt.plot(df["RSI"].values, label="Raw RSI", alpha=0.5)
plt.plot(df_scaled["RSI"].values, label="Scaled RSI", alpha=0.8)
plt.legend()
plt.title("Raw vs Scaled RSI")
plt.xlabel("Time Steps")
plt.ylabel("RSI Value")
plt.show()
No description has been provided for this image

The plot compares the original RSI values with their normalized counterparts using standardization. Although the scale is compressed, the overall trend and pattern are preserved. This confirms that scaling does not distort signal information but helps stabilize neural network training.

  1. Log Returns Over Time
In [ ]:
plt.figure(figsize=(12, 4))
plt.plot(df.index, df["LogReturn"], label="Log Return")
plt.axhline(0, linestyle='--', color='gray', alpha=0.6)
plt.title("Daily Log Returns Over Time")
plt.xlabel("Date")
plt.ylabel("Log Return")
plt.grid(True)
plt.legend()
plt.show()
No description has been provided for this image

This plot shows the daily log returns over the entire dataset period. A large spike in early 2020 corresponds to the COVID-19 market crash and rebound. This confirms the presence of high-volatility regimes and validates the importance of including such periods for robustness testing of forecasting models.

IV. Sliding Window Construction¶

  • Target: Next-day log return or close price
  • Input: N-day sequences of features

Create Sliding Window Dataset

In [ ]:
import numpy as np

def create_sliding_window(X, y, window_size=30):
    Xs, ys = [], []
    for i in range(len(X) - window_size):
        Xs.append(X[i:i + window_size])
        ys.append(y[i + window_size])
    return np.array(Xs), np.array(ys)
In [ ]:
# Define input and output
features = df_scaled[feature_cols].values
target = df_scaled["LogReturn"].values

# Create sliding window dataset
X, y = create_sliding_window(features, target, window_size=30)

# Check shapes
print("X shape:", X.shape)  # (samples, window_size, features)
print("y shape:", y.shape)  # (samples,)
X shape: (2472, 30, 5)
y shape: (2472,)

IV. Data Splitting (Walk-Forward Split)¶

In [ ]:
# 70% train, 15% val, 15% test
n = len(X)
train_end = int(n * 0.7)
val_end = int(n * 0.85)

X_train, y_train = X[:train_end], y[:train_end]
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
X_test, y_test = X[val_end:], y[val_end:]
In [ ]:
print("Train:", X_train.shape, y_train.shape)
print("Val:  ", X_val.shape, y_val.shape)
print("Test: ", X_test.shape, y_test.shape)
Train: (1730, 30, 5) (1730,)
Val:   (371, 30, 5) (371,)
Test:  (371, 30, 5) (371,)

Walk-Forward Split Strategy¶

For time series forecasting, I use a walk-forward split instead of random shuffling. This preserves the temporal order of the data and mimics real-world forecasting, where the model only has access to past observations.

Why Use Walk-Forward Splitting?¶

  • Prevents lookahead bias
  • Ensures model evaluation simulates true deployment
  • Maintains temporal causality (training on past, testing on future)
  • Allows assessment of robustness across market regimes (e.g., COVID-2020 → post-COVID recovery → inflation era)

Our Split:¶

  • Training Set (70%): 2015–2021 (approx.)
  • Validation Set (15%): 2022
  • Test Set (15%): 2023–2025

V. Baseline Model: LSTM for Log Return Forecasting¶

  1. Define the LSTM Architecture

This model implements a single-layer LSTM network with 64 hidden units, followed by a ReLU activation and two fully connected layers. The model outputs a single value representing the next day's log return. This forms the baseline for comparison with GRU and Transformer models in later sections.

In [ ]:
import torch
import torch.nn as nn

class LSTMRegressor(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=1, dropout=0.5):
        super(LSTMRegressor, self).__init__()

        # LSTM layer with specified input and hidden dimensions
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                            batch_first=True, dropout=dropout)

        # Fully connected layers for regression
        self.fc1 = nn.Linear(hidden_size, 32)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(32, 1)  # Predict a single log return

    def forward(self, x):
        out, _ = self.lstm(x)              # out shape: [batch_size, seq_len, hidden_size]
        out = out[:, -1, :]                # Select the output of the last time step
        out = self.relu(self.fc1(out))
        return self.fc2(out)

Instantiate

In [ ]:
input_size = X_train.shape[2]  # 5 features
model = LSTMRegressor(input_size)
  1. Loss Function and Optimizer

The model uses Mean Squared Error (MSE) as the loss function for log return regression. The Adam optimizer is used for training.

In [ ]:
import torch.optim as optim

# Move model to device
model = model.to(device)

# Loss function: MSE for regression
criterion = nn.MSELoss()

# Optimizer: Adam
optimizer = optim.Adam(model.parameters(), lr=1e-3)
  1. Train and Validate

Convert to PyTorch Dataset and DataLoader

In [ ]:
from torch.utils.data import TensorDataset, DataLoader

# Convert to tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)

X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32)

# Create datasets and loaders
train_ds = TensorDataset(X_train_tensor, y_train_tensor)
val_ds = TensorDataset(X_val_tensor, y_val_tensor)

train_loader = DataLoader(train_ds, batch_size=32, shuffle=False)
val_loader = DataLoader(val_ds, batch_size=32, shuffle=False)

The training and validation sets are converted into PyTorch TensorDataset and loaded using DataLoader. Which avoid shuffling to maintain temporal order during training. Each sample is a 30-day sequence of features, with the target being the next day's log return.

Define Training Loop

In [ ]:
def train_model(model, train_loader, val_loader, criterion, optimizer, epochs=50):
    model.to(device)
    train_losses, val_losses = [], []

    for epoch in range(epochs):
        model.train()
        epoch_train_loss = 0.0

        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            preds = model(xb).squeeze()
            loss = criterion(preds, yb)
            loss.backward()
            optimizer.step()
            epoch_train_loss += loss.item()

        train_loss = epoch_train_loss / len(train_loader)
        train_losses.append(train_loss)

        # Evaluate on validation set
        model.eval()
        with torch.no_grad():
            val_loss = sum(
                criterion(model(xb.to(device)).squeeze(), yb.to(device)).item()
                for xb, yb in val_loader
            ) / len(val_loader)
        val_losses.append(val_loss)

        print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")

    return model, train_losses, val_losses

The model is trained using a custom loop over mini-batches. Mean Squared Error (MSE) is computed for both training and validation sets at each epoch. Temporal order is preserved by not shuffling data during loading.

Train the Model

In [ ]:
# Train the model using the function
trained_model, train_losses, val_losses = train_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    criterion=criterion,
    optimizer=optimizer,
    epochs=50
)
Epoch 1/50 | Train Loss: 1.0350 | Val Loss: 1.4027
Epoch 2/50 | Train Loss: 1.0276 | Val Loss: 1.4016
Epoch 3/50 | Train Loss: 1.0203 | Val Loss: 1.4030
Epoch 4/50 | Train Loss: 1.0081 | Val Loss: 1.4492
Epoch 5/50 | Train Loss: 0.9924 | Val Loss: 1.4233
Epoch 6/50 | Train Loss: 0.9669 | Val Loss: 1.7221
Epoch 7/50 | Train Loss: 0.9521 | Val Loss: 1.9126
Epoch 8/50 | Train Loss: 0.9561 | Val Loss: 1.4753
Epoch 9/50 | Train Loss: 0.9230 | Val Loss: 1.8217
Epoch 10/50 | Train Loss: 0.9473 | Val Loss: 1.4530
Epoch 11/50 | Train Loss: 0.9300 | Val Loss: 1.5386
Epoch 12/50 | Train Loss: 0.8864 | Val Loss: 1.4918
Epoch 13/50 | Train Loss: 0.8589 | Val Loss: 1.6523
Epoch 14/50 | Train Loss: 0.8362 | Val Loss: 1.5123
Epoch 15/50 | Train Loss: 0.8177 | Val Loss: 1.6076
Epoch 16/50 | Train Loss: 0.7897 | Val Loss: 1.5215
Epoch 17/50 | Train Loss: 0.8083 | Val Loss: 1.6851
Epoch 18/50 | Train Loss: 0.7934 | Val Loss: 1.5412
Epoch 19/50 | Train Loss: 0.8214 | Val Loss: 1.4825
Epoch 20/50 | Train Loss: 0.7626 | Val Loss: 1.6418
Epoch 21/50 | Train Loss: 0.7114 | Val Loss: 1.6559
Epoch 22/50 | Train Loss: 0.6930 | Val Loss: 1.6667
Epoch 23/50 | Train Loss: 0.6866 | Val Loss: 1.7436
Epoch 24/50 | Train Loss: 0.7091 | Val Loss: 1.6224
Epoch 25/50 | Train Loss: 0.7244 | Val Loss: 1.7596
Epoch 26/50 | Train Loss: 0.7122 | Val Loss: 1.7439
Epoch 27/50 | Train Loss: 0.6772 | Val Loss: 1.7833
Epoch 28/50 | Train Loss: 0.6334 | Val Loss: 1.8134
Epoch 29/50 | Train Loss: 0.6106 | Val Loss: 1.8462
Epoch 30/50 | Train Loss: 0.6007 | Val Loss: 1.8630
Epoch 31/50 | Train Loss: 0.5858 | Val Loss: 1.9117
Epoch 32/50 | Train Loss: 0.5890 | Val Loss: 1.9265
Epoch 33/50 | Train Loss: 0.5757 | Val Loss: 1.9046
Epoch 34/50 | Train Loss: 0.5908 | Val Loss: 1.8915
Epoch 35/50 | Train Loss: 0.5654 | Val Loss: 1.8402
Epoch 36/50 | Train Loss: 0.5509 | Val Loss: 2.1095
Epoch 37/50 | Train Loss: 0.5517 | Val Loss: 1.9208
Epoch 38/50 | Train Loss: 0.5725 | Val Loss: 2.0798
Epoch 39/50 | Train Loss: 0.5726 | Val Loss: 2.1146
Epoch 40/50 | Train Loss: 0.5821 | Val Loss: 2.0669
Epoch 41/50 | Train Loss: 0.5534 | Val Loss: 1.9426
Epoch 42/50 | Train Loss: 0.5267 | Val Loss: 1.9810
Epoch 43/50 | Train Loss: 0.5070 | Val Loss: 2.0834
Epoch 44/50 | Train Loss: 0.4996 | Val Loss: 2.1690
Epoch 45/50 | Train Loss: 0.4972 | Val Loss: 2.0050
Epoch 46/50 | Train Loss: 0.5094 | Val Loss: 1.9477
Epoch 47/50 | Train Loss: 0.5071 | Val Loss: 2.1182
Epoch 48/50 | Train Loss: 0.4767 | Val Loss: 2.1653
Epoch 49/50 | Train Loss: 0.4636 | Val Loss: 2.2164
Epoch 50/50 | Train Loss: 0.4780 | Val Loss: 2.1549

Save training details

In [ ]:
import pickle

# Save training history
with open("lstm_training_history.pkl", "wb") as f:
    pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)

Plot Loss Curves

In [ ]:
import matplotlib.pyplot as plt

plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Training and Validation Loss Over Epochs")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

This plot shows the mean squared error (MSE) loss over 50 epochs for the baseline LSTM model. While the training loss steadily decreases, the validation loss begins to rise after approximately 20 epochs, indicating early signs of overfitting. This suggests that the model is learning the training data well but may be losing generalization capability over time.

In future phases, techniques such as early stopping, dropout tuning, or learning rate scheduling may help mitigate overfitting and improve performance on unseen data.

  1. Evaluate on Test Set

Prepare Test Data

In [ ]:
# Convert test data to tensors
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).to(device)

Predict and Compare

In [ ]:
# Put model in eval mode
model.eval()

# Disable gradient tracking
with torch.no_grad():
    y_pred_tensor = model(X_test_tensor).squeeze()

# Convert to cpu numpy arrays
y_true = y_test_tensor.detach().cpu().numpy()
y_pred_lstm = y_pred_tensor.detach().cpu().numpy()

Compute Metrics

In [ ]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

mse = mean_squared_error(y_true, y_pred_lstm)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred_lstm)

print(f"LSTM Test MSE : {mse:.4f}")
print(f"LSTM Test RMSE: {rmse:.4f}")
print(f"LSTM Test MAE : {mae:.4f}")
LSTM Test MSE : 0.5669
LSTM Test RMSE: 0.7529
LSTM Test MAE : 0.5677

The baseline LSTM model was evaluated on the final 15% of the dataset using MSE, RMSE, and MAE metrics:

  • Test MSE: 0.5669
  • Test RMSE: 0.7529
  • Test MAE: 0.5677

Plot Predictions vs Actual

After training, the model was evaluated on a held-out test set representing the most recent portion of the time series. Metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) were computed. The plot below shows predicted vs actual log returns to visualize model performance.

In [ ]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))
plt.plot(y_true, label="Actual", alpha=0.7)
plt.plot(y_pred_lstm, label="Predicted", alpha=0.7)
plt.title("Predicted vs Actual Log Returns on Test Set")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The plot shows predicted vs actual daily log returns. The model captures the general trend but underestimates extreme values, a common limitation of non-probabilistic baselines. These results will serve as a benchmark for more advanced models and uncertainty-aware extensions in future phases.

  1. Save the weights
In [ ]:
# Save model weights to a .pt file
torch.save(model.state_dict(), "project_weights_lstm_baseline.pt")

VI. Baseline Model: GRU for Log Return Forecasting¶

To complement the LSTM baseline, implemented a GRU model using the same architecture and hyperparameters. GRUs are computationally more efficient while retaining the ability to model short-to-medium-term dependencies. Evaluation metrics are computed on the same test set to provide a fair comparison with the LSTM model.

  1. GRU Model Definition
In [ ]:
import torch.nn as nn

class GRURegressor(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=1, dropout=0.2):
        super(GRURegressor, self).__init__()

        # GRU layer with specified input and hidden dimensions
        self.gru = nn.GRU(input_size, hidden_size, num_layers,
                          batch_first=True, dropout=dropout)

        # Fully connected layers for regression
        self.fc1 = nn.Linear(hidden_size, 32)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(32, 1)

    def forward(self, x):
        out, _ = self.gru(x)                # out shape: [batch_size, seq_len, hidden_size]
        out = out[:, -1, :]                 # Select the output of the last time step
        out = self.relu(self.fc1(out))
        return self.fc2(out)
  1. Instantiate and Compile GRU Model
In [ ]:
# Instantiate GRU model
gru_model = GRURegressor(input_size=X_train.shape[2]).to(device)


# Define loss and optimizer
criterion = nn.MSELoss()
optimizer_gru = torch.optim.Adam(gru_model.parameters(), lr=1e-3)
  1. Train the GRU Model
In [ ]:
gru_model, gru_train_losses, gru_val_losses = train_model(
    model=gru_model,
    train_loader=train_loader,
    val_loader=val_loader,
    criterion=criterion,
    optimizer=optimizer_gru,
    epochs=50
)
Epoch 1/50 | Train Loss: 1.0347 | Val Loss: 1.4047
Epoch 2/50 | Train Loss: 1.0236 | Val Loss: 1.4122
Epoch 3/50 | Train Loss: 1.0134 | Val Loss: 1.4159
Epoch 4/50 | Train Loss: 1.0006 | Val Loss: 1.4433
Epoch 5/50 | Train Loss: 0.9824 | Val Loss: 1.5100
Epoch 6/50 | Train Loss: 0.9623 | Val Loss: 1.5791
Epoch 7/50 | Train Loss: 0.9435 | Val Loss: 1.6951
Epoch 8/50 | Train Loss: 0.9243 | Val Loss: 1.5223
Epoch 9/50 | Train Loss: 0.9030 | Val Loss: 1.5249
Epoch 10/50 | Train Loss: 0.8895 | Val Loss: 1.9763
Epoch 11/50 | Train Loss: 0.8954 | Val Loss: 1.5582
Epoch 12/50 | Train Loss: 0.8797 | Val Loss: 1.5629
Epoch 13/50 | Train Loss: 0.8479 | Val Loss: 1.5804
Epoch 14/50 | Train Loss: 0.8101 | Val Loss: 1.6208
Epoch 15/50 | Train Loss: 0.7693 | Val Loss: 1.7468
Epoch 16/50 | Train Loss: 0.7546 | Val Loss: 1.5480
Epoch 17/50 | Train Loss: 0.7884 | Val Loss: 1.8249
Epoch 18/50 | Train Loss: 0.7483 | Val Loss: 1.6710
Epoch 19/50 | Train Loss: 0.7880 | Val Loss: 1.5944
Epoch 20/50 | Train Loss: 0.7130 | Val Loss: 1.7566
Epoch 21/50 | Train Loss: 0.6660 | Val Loss: 1.6715
Epoch 22/50 | Train Loss: 0.6542 | Val Loss: 1.7973
Epoch 23/50 | Train Loss: 0.6446 | Val Loss: 1.6629
Epoch 24/50 | Train Loss: 0.6484 | Val Loss: 1.8345
Epoch 25/50 | Train Loss: 0.6391 | Val Loss: 1.6601
Epoch 26/50 | Train Loss: 0.6400 | Val Loss: 1.8181
Epoch 27/50 | Train Loss: 0.6175 | Val Loss: 1.7298
Epoch 28/50 | Train Loss: 0.6035 | Val Loss: 1.7988
Epoch 29/50 | Train Loss: 0.5898 | Val Loss: 1.7125
Epoch 30/50 | Train Loss: 0.5833 | Val Loss: 1.8202
Epoch 31/50 | Train Loss: 0.5756 | Val Loss: 1.6678
Epoch 32/50 | Train Loss: 0.5718 | Val Loss: 1.9482
Epoch 33/50 | Train Loss: 0.5709 | Val Loss: 1.5979
Epoch 34/50 | Train Loss: 0.5901 | Val Loss: 2.1256
Epoch 35/50 | Train Loss: 0.5756 | Val Loss: 1.6334
Epoch 36/50 | Train Loss: 0.5657 | Val Loss: 1.9053
Epoch 37/50 | Train Loss: 0.5437 | Val Loss: 1.7698
Epoch 38/50 | Train Loss: 0.5303 | Val Loss: 1.7931
Epoch 39/50 | Train Loss: 0.5379 | Val Loss: 1.7863
Epoch 40/50 | Train Loss: 0.5214 | Val Loss: 1.8299
Epoch 41/50 | Train Loss: 0.5198 | Val Loss: 1.9728
Epoch 42/50 | Train Loss: 0.5124 | Val Loss: 1.9133
Epoch 43/50 | Train Loss: 0.5156 | Val Loss: 2.0756
Epoch 44/50 | Train Loss: 0.5169 | Val Loss: 1.8781
Epoch 45/50 | Train Loss: 0.5161 | Val Loss: 2.0113
Epoch 46/50 | Train Loss: 0.4915 | Val Loss: 1.9908
Epoch 47/50 | Train Loss: 0.4860 | Val Loss: 2.1369
Epoch 48/50 | Train Loss: 0.4755 | Val Loss: 2.0962
Epoch 49/50 | Train Loss: 0.4742 | Val Loss: 2.2750
Epoch 50/50 | Train Loss: 0.4851 | Val Loss: 2.1362

Save training details

In [ ]:
import pickle

# Save training history
with open("gru_training_history.pkl", "wb") as f:
    pickle.dump({"train_losses": gru_train_losses, "val_losses": gru_val_losses}, f)

Plot Loss Curves

In [ ]:
import matplotlib.pyplot as plt

plt.plot(gru_train_losses, label="Train Loss")
plt.plot(gru_val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Training and Validation Loss Over Epochs")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The GRU model also exhibits a clear decline in training loss over time, demonstrating its ability to learn patterns from the training data. However, the validation loss remains consistently higher and more volatile, with noticeable spikes starting around epoch 10. This suggests the model struggles more with generalization compared to LSTM.

The increased variance in the validation curve may indicate sensitivity to specific patterns or a lack of capacity to model more complex dependencies. Further improvements could be explored through tuning dropout rates, adjusting hidden units, or applying regularization techniques.

  1. Evaluate GRU on Test Set

Predict and Compare

In [ ]:
# Put model in eval mode
gru_model.eval()

# Disable gradient tracking
with torch.no_grad():
    y_pred_tensor = gru_model(X_test_tensor).squeeze()

# Convert to cpu numpy arrays
y_true = y_test_tensor.cpu().numpy()
y_pred_gru = y_pred_tensor.cpu().numpy()

Compute Metrics

In [ ]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

mse = mean_squared_error(y_true, y_pred_gru)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred_gru)

print(f"GRU Test MSE : {mse:.4f}")
print(f"GRU Test RMSE: {rmse:.4f}")
print(f"GRU Test MAE : {mae:.4f}")
GRU Test MSE : 0.6792
GRU Test RMSE: 0.8241
GRU Test MAE : 0.6231

The GRU model was trained using the same architecture, window size, and hyperparameters as the LSTM baseline to ensure a fair comparison. Below are the evaluation results on the same test set:

  • GRU Test MSE : 0.6792
  • GRU Test RMSE: 0.8241
  • GRU Test MAE : 0.6231

For reference, the LSTM test metrics were:

  • LSTM Test MSE: 0.5669
  • LSTM Test RMSE: 0.7529
  • LSTM Test MAE: 0.5677

The LSTM outperformed GRU slightly across all evaluation metrics, indicating it may be better suited to capturing the short-term dependencies in the dataset. However, the GRU still performed reasonably well and serves as a valid baseline model.

Plot Predictions vs Actual

In [ ]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))
plt.plot(y_true, label="Actual", alpha=0.7)
plt.plot(y_pred_gru, label="Predicted", alpha=0.7)
plt.title("Predicted vs Actual Log Returns on Test Set")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The plot above compares the GRU model’s predicted log returns against actual values on the test set. While the GRU captures overall trends and stays relatively stable, it tends to underfit high-volatility spikes and sharp directional movements. This mirrors the behavior seen with the LSTM, though GRU exhibits slightly larger deviation in volatile regions.

Overall, the GRU provides a reasonable forecast baseline, but the LSTM shows slightly stronger alignment with real movements. This visual comparison, alongside quantitative metrics, reinforces the decision to use LSTM as the primary architecture for further tuning and uncertainty modeling.

Save the weights

In [ ]:
# Save model weights to a .pt file
torch.save(gru_model.state_dict(), "project_weights_gru_baseline.pt")

Side-by-Side Plot: LSTM vs GRU vs Actual

In [ ]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))
plt.plot(y_true, label="Actual", linewidth=1)
plt.plot(y_pred_lstm, label="LSTM Predicted", alpha=0.8)
plt.plot(y_pred_gru, label="GRU Predicted", alpha=0.8)
plt.title("Actual vs LSTM vs GRU Log Returns on Test Set")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

This plot overlays the predictions of both the LSTM and GRU models against the actual log returns on the test set. Visually, we observe that:

  • LSTM tends to follow the direction and magnitude of large movements more closely
  • GRU captures trend direction but exhibits slightly more smoothing
  • Both models struggle with extreme spikes, which are inherently difficult to forecast due to noise and market shocks

This aligns well with our earlier quantitative results, where LSTM had lower RMSE and MAE. These visual and numeric insights guide us to select LSTM as the stronger baseline for further uncertainty modeling and risk estimation.


LSTM & GRU¶


I. Modularize Data Preparation¶

In [ ]:
def prepare_data(df_scaled, feature_cols, target_col="LogReturn", seq_len=30, split_ratio=(0.7, 0.15, 0.15)):

    # Extract input features and target
    X_raw = df_scaled[feature_cols].values
    y_raw = df_scaled[target_col].values

    # Create sliding windows
    X_seq, y_seq = create_sliding_window(X_raw, y_raw, window_size=seq_len)

    # Split
    n = len(X_seq)
    train_end = int(n * split_ratio[0])
    val_end = int(n * (split_ratio[0] + split_ratio[1]))

    X_train, y_train = X_seq[:train_end], y_seq[:train_end]
    X_val, y_val     = X_seq[train_end:val_end], y_seq[train_end:val_end]
    X_test, y_test   = X_seq[val_end:], y_seq[val_end:]

    return X_train, y_train, X_val, y_val, X_test, y_test

II. Modularize Data Loaders¶

In [ ]:
from torch.utils.data import TensorDataset, DataLoader

def get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32):

    # Wrap training and validation data into DataLoaders.

    X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
    X_val_tensor   = torch.tensor(X_val, dtype=torch.float32)
    y_val_tensor   = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1)

    train_ds = TensorDataset(X_train_tensor, y_train_tensor)
    val_ds   = TensorDataset(X_val_tensor, y_val_tensor)

    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=False)
    val_loader   = DataLoader(val_ds, batch_size=batch_size, shuffle=False)

    return train_loader, val_loader

LSTM¶


I. Training Loop for LSTM & GRU (Modularized)¶

In [ ]:
import time
from torch.utils.tensorboard import SummaryWriter

def train_model(
    model,
    train_loader,
    val_loader,
    criterion,
    optimizer,
    epochs=50,
    device=device,
    verbose=True,
    log_to_tensorboard=True,
    config_name=None
):
    model.to(device)
    train_losses, val_losses = [], []

    # TensorBoard writer
    writer = None
    if log_to_tensorboard:
        tag = config_name or f"{model.__class__.__name__}_{int(time.time())}"
        writer = SummaryWriter(log_dir=f"runs/{tag}")

    for epoch in range(epochs):
        model.train()
        epoch_train_loss = 0.0

        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()

            preds = model(xb)
            loss = criterion(preds, yb)
            loss.backward()
            optimizer.step()
            epoch_train_loss += loss.item()

        train_loss = epoch_train_loss / len(train_loader)
        train_losses.append(train_loss)

        # Validation
        model.eval()
        with torch.no_grad():
            val_loss = sum(
                criterion(model(xb.to(device)), yb.to(device)).item()
                for xb, yb in val_loader
            ) / len(val_loader)
        val_losses.append(val_loss)

        if verbose:
            print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")

        if writer:
            writer.add_scalar("Loss/Train", train_loss, epoch)
            writer.add_scalar("Loss/Val", val_loss, epoch)

    if writer:
        writer.close()

    return model, train_losses, val_losses

II . Find the Best Model by Grid Search¶

1. Define a Hyperparameter Grid

In [ ]:
import itertools

param_grid = {
    'hidden_size': [32, 64],
    'dropout': [0.2, 0.3],
    'lr': [1e-3, 5e-4],
    'seq_len': [30, 60],
    'num_layers': [2, 3]
}

def generate_configs(grid):
    keys = grid.keys()
    for values in itertools.product(*grid.values()):
        yield dict(zip(keys, values))

2. LSTM: Hyperparameter Search Loop

2.1. Evaluation function

In [ ]:
def evaluate_model(model, data_loader, criterion=nn.MSELoss(), device=device, return_predictions=False):

    model.eval()
    model.to(device)

    preds, targets = [], []

    with torch.no_grad():
        for xb, yb in data_loader:
            xb, yb = xb.to(device), yb.to(device)
            pred = model(xb)
            preds.append(pred.cpu())
            targets.append(yb.cpu())

    preds = torch.cat(preds).squeeze()
    targets = torch.cat(targets).squeeze()

    mse = torch.mean((preds - targets) ** 2).item()
    rmse = np.sqrt(mse)
    mae = torch.mean(torch.abs(preds - targets)).item()

    if return_predictions:
        return mse, rmse, mae, preds.numpy(), targets.numpy()
    else:
        return mse, rmse, mae

2.2. Training

In [ ]:
results = []

for config in generate_configs(param_grid):
    print(f"Running config: {config}")

    # Prepare data based on seq_len
    X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
        df_scaled, feature_cols, seq_len=config['seq_len']
    )
    train_loader, val_loader = get_data_loaders(
        X_train, y_train, X_val, y_val, batch_size=32
    )


    # Model
    model = LSTMRegressor(
        input_size=X_train.shape[2],
        hidden_size=config['hidden_size'],
        dropout=config['dropout'],
        num_layers=config['num_layers']  # dynamic
    )

    # Optimizer, criterion
    optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
    criterion = nn.MSELoss()

    # Train
    model, train_losses, val_losses = train_model(
        model, train_loader, val_loader,
        criterion=criterion,
        optimizer=optimizer,
        epochs=50,
        verbose=False,
        log_to_tensorboard=True,
        config_name=f"LSTM_h{config['hidden_size']}_nl{config['num_layers']}_sl{config['seq_len']}_lr{config['lr']}"
    )


    # Evaluate on validation set
    val_loader_only, _ = get_data_loaders(X_val, y_val, X_val, y_val)
    _, rmse, mae = evaluate_model(model, val_loader_only)

    results.append((config, rmse, mae))
    print(f"RMSE: {rmse:.4f} | MAE: {mae:.4f}\n")
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2739 | MAE: 0.9764

Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.4027 | MAE: 1.0179

Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3532 | MAE: 1.0162

Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2236 | MAE: 0.9321

Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2990 | MAE: 1.0089

Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2495 | MAE: 0.9605

Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.6032 | MAE: 1.1632

Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2177 | MAE: 0.9370

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.4782 | MAE: 1.0958

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2438 | MAE: 0.9436

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3286 | MAE: 1.0067

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.3198 | MAE: 0.9679

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.3366 | MAE: 1.0381

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2445 | MAE: 0.9563

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.1990 | MAE: 0.9233

Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2087 | MAE: 0.9304

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.3202 | MAE: 1.0232

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2567 | MAE: 0.9782

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.2624 | MAE: 0.9647

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.1957 | MAE: 0.9222

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2610 | MAE: 0.9838

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.4576 | MAE: 1.0505

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3276 | MAE: 0.9915

Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2283 | MAE: 0.9391

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2612 | MAE: 0.9836

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2422 | MAE: 0.9523

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.4392 | MAE: 1.0741

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2831 | MAE: 0.9619

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.3363 | MAE: 1.0200

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.4134 | MAE: 1.0198

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3152 | MAE: 0.9970

Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2753 | MAE: 0.9631

Performed a grid search over 32 different LSTM configurations by varying hidden size, dropout, learning rate, sequence length, and the number of layers. This helped us systematically evaluate performance across different setups and identify the best-performing model based on RMSE and MAE. This tuning step was crucial for ensuring the LSTM baseline was both competitive and well-calibrated before comparing it with GRU and Transformer architectures.

2.3. TensorBoard Vizualization

In [ ]:
%load_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.

2.4 Rank Top Configs (on Validation RMSE)

In [ ]:
results.sort(key=lambda x: x[1])

print("Top 5 LSTM Configurations (by Validation RMSE):\n")
for i, (config, rmse, mae) in enumerate(results[:5]):
    print(f"{i+1}. {config} | RMSE: {rmse:.4f} | MAE: {mae:.4f}")
Top 5 LSTM Configurations (by Validation RMSE):

1. {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.1957 | MAE: 0.9222
2. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2} | RMSE: 1.1990 | MAE: 0.9233
3. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2087 | MAE: 0.9304
4. {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2177 | MAE: 0.9370
5. {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2236 | MAE: 0.9321

III . Evaluate Top Configs on Test Set¶

In [ ]:
best_config = results[0][0]  # Top config

# Re-prepare data with the correct seq_len
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
    df_scaled, feature_cols, seq_len=best_config['seq_len']
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)

model = LSTMRegressor(
    input_size=X_train.shape[2],
    hidden_size=best_config['hidden_size'],
    dropout=best_config['dropout'],
    num_layers=best_config['num_layers']
)
optimizer = torch.optim.Adam(model.parameters(), lr=best_config['lr'])
criterion = nn.MSELoss()

# Retrain full model
model, _, _ = train_model(model, train_loader, val_loader, criterion, optimizer, epochs=50, verbose=False)

# Evaluate on test
mse, rmse, mae = evaluate_model(model, test_loader)
print(f"\nFinal Test Results — Best LSTM Config:")
print(f"LSTM Test MSE : {mse:.4f}")
print(f"LSTM Test RMSE: {rmse:.4f}")
print(f"LSTM Test MAE : {mae:.4f}")
Final Test Results — Best LSTM Config:
LSTM Test MSE : 0.4784
LSTM Test RMSE: 0.6917
LSTM Test MAE : 0.5207

After conducting grid search over 32 LSTM configurations, identified the best-performing model with significantly improved results. Compared to the baseline LSTM, which achieved a test RMSE of 0.7529 and MAE of 0.5677, the tuned LSTM reduced the RMSE to 0.6917 and MAE to 0.5207. This improvement demonstrates the value of systematic hyperparameter optimization in enhancing model accuracy for financial time series forecasting.

IV. Saving the experiment¶

In [ ]:
import os
import torch
import json
import pickle
import numpy as np
import pandas as pd

def save_experiment(
    model,                    # trained model
    config,                   # best_config dict
    train_losses=None,
    val_losses=None,
    y_true=None,
    y_pred=None,
    output_dir="experiment_lstm_tuned",
    model_filename="project_weights_lstm_tuned.pt"
):
    os.makedirs(output_dir, exist_ok=True)

    # Save model weights
    model_path = os.path.join(output_dir, model_filename)
    torch.save(model.state_dict(), model_path)

    # Save config
    config_path = os.path.join(output_dir, "best_config.json")
    with open(config_path, "w") as f:
        json.dump(config, f, indent=4)

    # Save training history
    if train_losses is not None and val_losses is not None:
        history_path = os.path.join(output_dir, "training_history.pkl")
        with open(history_path, "wb") as f:
            pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)

    # Save predictions
    if y_true is not None and y_pred is not None:
        df_preds = pd.DataFrame({
            "Actual": np.array(y_true),
            "Predicted": np.array(y_pred)
        })
        df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)

    print(f"Experiment saved to: {output_dir}")

Saving the experiment

In [ ]:
save_experiment(
    model=model,
    config=best_config,
    train_losses=train_losses,
    val_losses=val_losses,
    y_true=y_true,
    y_pred=y_pred_lstm,
    output_dir="experiment_lstm_tuned"
)

V. Plots¶

1. Train vs Validation Loss Curve

In [ ]:
import pickle
import matplotlib.pyplot as plt

# Load training history
with open("experiment_lstm_tuned/training_history.pkl", "rb") as f:
    history = pickle.load(f)
    train_losses = history["train_losses"]
    val_losses = history["val_losses"]

# Plot
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("LSTM: Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The plot shows a steady decline in training loss, while the validation loss remains relatively flat with high variance. This suggests that while the model is learning on the training data, it may be struggling to generalize, potentially due to overfitting or high variance in the validation set. Techniques like early stopping or regularization could help stabilize performance further.

2. Predicted vs Actual (Line Plot)

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Load LSTM predictions
df_preds_lstm = pd.read_csv("experiment_lstm_tuned/test_predictions.csv")

# Plot
plt.figure(figsize=(10, 4))
plt.plot(df_preds_lstm["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds_lstm["Predicted"], label="Predicted", alpha=0.7)
plt.title("LSTM: Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The plot shows that while the predicted values capture the overall trend and central tendency of the actual returns, they tend to smooth out the extreme fluctuations. This is typical in regression-based models, where the focus is on minimizing average error rather than capturing rare, high-volatility events.

3. Scatter Plot: Actual vs Predicted

In [ ]:
plt.figure(figsize=(6, 6))
plt.scatter(df_preds_lstm["Actual"], df_preds_lstm["Predicted"], alpha=0.5, color='steelblue')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--')  # Identity line
plt.title("LSTM: Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()
No description has been provided for this image

While the predictions are generally centered around zero and follow the correct trend, they cluster tightly, indicating the model tends to underestimate extreme movements. The scatter around the diagonal suggests reasonable correlation, but limited responsiveness to higher volatility, a common challenge in financial return modeling


GRU¶


I. Find the Best Model by Grid Search¶

1. Define a Hyperparameter Grid

In [ ]:
param_grid_gru = {
    'hidden_size': [32, 64],
    'dropout': [0.2, 0.3],
    'lr': [1e-3, 5e-4],
    'seq_len': [30, 60],
    'num_layers': [2, 3]
}

2. GRU: Hyperparameter Search Loop

In [ ]:
results_gru = []

for config in generate_configs(param_grid_gru):
    print(f"Running GRU config: {config}")

    # Prepare data with the given sequence length
    X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
        df_scaled, feature_cols, seq_len=config['seq_len']
    )
    train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)

    # Instantiate GRU model
    model = GRURegressor(
        input_size=X_train.shape[2],
        hidden_size=config['hidden_size'],
        dropout=config['dropout'],
        num_layers=config['num_layers']
    )

    optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
    criterion = nn.MSELoss()

    # Train GRU model
    model, train_losses, val_losses = train_model(
        model, train_loader, val_loader,
        criterion=criterion,
        optimizer=optimizer,
        epochs=50,
        verbose=False,
        log_to_tensorboard=True,
        config_name=f"GRU_h{config['hidden_size']}_nl{config['num_layers']}_sl{config['seq_len']}_lr{config['lr']}"
    )

    # Evaluate on validation set
    val_loader_only, _ = get_data_loaders(X_val, y_val, X_val, y_val)
    _, rmse, mae = evaluate_model(model, val_loader_only)

    results_gru.append((config, rmse, mae))
    print(f"GRU RMSE: {rmse:.4f} | MAE: {mae:.4f}\n")
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3826 | MAE: 1.0499

Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2991 | MAE: 1.0041

Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.5463 | MAE: 1.1125

Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2201 | MAE: 0.9406

Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.2816 | MAE: 0.9862

Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.6661 | MAE: 1.1922

Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.7062 | MAE: 1.2154

Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2622 | MAE: 0.9552

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3020 | MAE: 0.9947

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.3118 | MAE: 0.9860

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.4116 | MAE: 1.0473

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2075 | MAE: 0.9313

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3214 | MAE: 1.0105

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2536 | MAE: 0.9644

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.3265 | MAE: 1.0065

Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2024 | MAE: 0.9287

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.5041 | MAE: 1.1170

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2533 | MAE: 0.9717

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.2930 | MAE: 0.9761

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.4640 | MAE: 1.0619

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3986 | MAE: 1.0541

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.3829 | MAE: 1.0449

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.4323 | MAE: 1.0535

Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.4907 | MAE: 1.0778

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.2652 | MAE: 0.9760

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2893 | MAE: 0.9983

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.3142 | MAE: 0.9940

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.3776 | MAE: 1.0213

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3987 | MAE: 1.0555

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2457 | MAE: 0.9615

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.3005 | MAE: 0.9880

Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.4227 | MAE: 1.0452

Similar to the LSTM, performed an exhaustive grid search to identify the best GRU configuration across 32 hyperparameter combinations. The tuning process explored variations in hidden size, dropout, learning rate, sequence length, and number of layers.

2.3. TensorBoard Vizualization

In [ ]:
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.

2.4 Rank Top Configs (on Validation RMSE)

In [ ]:
results_gru.sort(key=lambda x: x[1])
for i, (config, rmse, mae) in enumerate(results_gru[:5]):
    print(f"{i+1}. {config} | RMSE: {rmse:.4f} | MAE: {mae:.4f}")
1. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2024 | MAE: 0.9287
2. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2075 | MAE: 0.9313
3. {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2201 | MAE: 0.9406
4. {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3} | RMSE: 1.2457 | MAE: 0.9615
5. {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3} | RMSE: 1.2533 | MAE: 0.9717

III . Evaluate Top Configs on Test Set¶

In [ ]:
# Pick best config
best_gru_config = results_gru[0][0]  # Top config

# Prepare data using best seq_len
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
    df_scaled, feature_cols, seq_len=best_gru_config['seq_len']
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)

# Build GRU model
model_gru = GRURegressor(
    input_size=X_train.shape[2],
    hidden_size=best_gru_config['hidden_size'],
    dropout=best_gru_config['dropout'],
    num_layers=best_gru_config['num_layers']
)

# Optimizer and criterion
optimizer = torch.optim.Adam(model_gru.parameters(), lr=best_gru_config['lr'])
criterion = nn.MSELoss()

# Retrain on train+val
model_gru, gru_train_losses, gru_val_losses = train_model(
    model_gru, train_loader, val_loader,
    criterion=criterion, optimizer=optimizer,
    epochs=50, verbose=False
)

# Evaluate on test
mse, rmse, mae = evaluate_model(model_gru, test_loader)
print(f"\nFinal Test Results — Best GRU Config:")
print(f"GRU Test MSE : {mse:.4f}")
print(f"GRU Test RMSE: {rmse:.4f}")
print(f"GRU Test MAE : {mae:.4f}")
Final Test Results — Best GRU Config:
GRU Test MSE : 0.4774
GRU Test RMSE: 0.6909
GRU Test MAE : 0.5197

The baseline GRU model yielded a test RMSE of 0.8241 and MAE of 0.6231. After performing grid search across 32 hyperparameter combinations, the best GRU configuration significantly improved performance, reducing the RMSE to 0.6909 and MAE to 0.5197. This result highlights the impact of systematic tuning in enhancing the GRU model’s ability to capture temporal patterns in financial log returns.

IV. Saving the experiment¶

In [ ]:
import os
import torch
import json
import pickle
import numpy as np
import pandas as pd

def save_experiment_gru(
    model,                    # trained GRU model
    config,                   # best_config dict
    train_losses=None,
    val_losses=None,
    y_true=None,
    y_pred=None,
    output_dir="experiment_gru_tuned",
    model_filename="project_weights_gru_tuned.pt"
):
    os.makedirs(output_dir, exist_ok=True)

    # Save model weights
    model_path = os.path.join(output_dir, model_filename)
    torch.save(model.state_dict(), model_path)

    # Save config
    config_path = os.path.join(output_dir, "best_config.json")
    with open(config_path, "w") as f:
        json.dump(config, f, indent=4)

    # Save training history
    if train_losses is not None and val_losses is not None:
        history_path = os.path.join(output_dir, "training_history.pkl")
        with open(history_path, "wb") as f:
            pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)

    # Save predictions
    if y_true is not None and y_pred is not None:
        df_preds = pd.DataFrame({
            "Actual": np.array(y_true),
            "Predicted": np.array(y_pred)
        })
        df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)

    print(f"GRU experiment saved to: {output_dir}")

Saving the experiment

In [ ]:
# Predict on test set
model_gru.eval()
with torch.no_grad():
    y_pred_tensor = model_gru(torch.tensor(X_test, dtype=torch.float32).to(device)).squeeze()

y_true = y_test  # already numpy
y_pred_gru = y_pred_tensor.cpu().numpy()

# Save everything
save_experiment_gru(
    model=model_gru,
    config=best_gru_config,
    train_losses=gru_train_losses,
    val_losses=gru_val_losses,
    y_true=y_true,
    y_pred=y_pred_gru,
    output_dir="experiment_gru_tuned",
    model_filename="project_weights_gru_tuned.pt"
)

V. Plots¶

  1. Training vs Validation Loss Curve
In [ ]:
import matplotlib.pyplot as plt
import pickle

# Load training history
with open("experiment_gru_tuned/training_history.pkl", "rb") as f:
    history = pickle.load(f)
    gru_train_losses = history["train_losses"]
    gru_val_losses = history["val_losses"]

# Plot
plt.figure(figsize=(8, 4))
plt.plot(gru_train_losses, label="Train Loss")
plt.plot(gru_val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("GRU: Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The training loss shows a gradual decline, indicating the model is learning effectively on the training data. However, the validation loss exhibits noticeable variance and instability, suggesting potential overfitting or high sensitivity to noise in the validation set. This could potentially be improved with techniques like early stopping, regularization, or more robust validation splits.

  1. Predicted vs Actual (Line Plot)
In [ ]:
import pandas as pd

# Load predictions
df_preds = pd.read_csv("experiment_gru_tuned/test_predictions.csv")

# Plot
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("GRU: Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

While the GRU model captures the overall direction and trend centrality of the series, it underestimates the magnitude of rapid changes and high-volatility spikes. This results in smoother predicted values that track the general trend but miss sharp fluctuations, a common tradeoff in deep learning models trained with MSE-based objectives on noisy financial data.

  1. Scatter Plot: Actual vs Predicted
In [ ]:
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='orange')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--')  # Identity line
plt.title("GRU: Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.show()
WARNING:matplotlib.axes._base:Ignoring fixed x limits to fulfill fixed data aspect with adjustable data limits.
No description has been provided for this image

The predictions are tightly clustered around zero, indicating the GRU model tends to regress toward the mean and underestimates the magnitude of larger log returns. This conservative behavior is common in financial models trained to minimize MSE, especially when the data is noisy and exhibits heavy-tailed distributions.

Transfomer¶


Transformer (Vanilla)¶


I. Define the Vanilla Transformer¶

In [ ]:
import torch
import torch.nn as nn
import math

# Positional encoding module for adding temporal information to input embeddings
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()

        # Create a matrix of shape (max_len, d_model) for positional encodings
        pe = torch.zeros(max_len, d_model)

        # Generate position indices (0 to max_len - 1) as a column vector
        position = torch.arange(0, max_len).unsqueeze(1)

        # Compute the denominator term for sine/cosine frequencies
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

        # Apply sine to even indices in the embedding dimension
        pe[:, 0::2] = torch.sin(position * div_term)
        # Apply cosine to odd indices in the embedding dimension
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0) # Add a batch dimension (1, max_len, d_model)

    def forward(self, x):
        # Add positional encodings to the input tensor
        x = x + self.pe[:, :x.size(1)].to(x.device) # x: (batch_size, seq_len, d_model)
        return x



class TimeSeriesTransformer(nn.Module):
    def __init__(self, input_dim, model_dim=64, num_heads=4, num_layers=2, dropout=0.1):
        super().__init__()
        # Project raw input features into model dimension space
        self.input_proj = nn.Linear(input_dim, model_dim)
        # Add positional encoding to the projected inputs
        self.pos_encoder = PositionalEncoding(model_dim)

        # Define a single Transformer encoder layer
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=model_dim,
            nhead=num_heads,
            dim_feedforward=128,
            dropout=dropout,
            batch_first=True # Enable (batch, seq, feature) input format
        )

        # Stack multiple encoder layers to form the full Transformer encoder
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # Output head: MLP to map final hidden state to a scalar prediction
        self.head = nn.Sequential(
            nn.Linear(model_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )

    def forward(self, x):
        x = self.input_proj(x)    # Project input to model dimension
        x = self.pos_encoder(x)   # Add positional encoding
        x = self.transformer(x)   # Pass through Transformer encoder
        out = x[:, -1, :]         # Use the last token's output as the representation for prediction
        return self.head(out)     # Predict the next value using the MLP head

II. Training & Evaluation Setup¶

In [ ]:
import time
from torch.utils.tensorboard import SummaryWriter

# Training loop for a PyTorch model with TensorBoard logging
def train_model(
    model,
    train_loader,
    val_loader,
    criterion,
    optimizer,
    epochs=50,
    device=device,
    log_to_tensorboard=True,
    config_name="transformer_default",
    verbose=True
):
    model.to(device)
    train_losses, val_losses = [], []
    # Setup TensorBoard writer
    writer = SummaryWriter(log_dir=f"runs/{config_name}") if log_to_tensorboard else None

    for epoch in range(epochs):
        model.train()
        train_loss = 0.0

        # Training step
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            preds = model(xb)
            loss = criterion(preds, yb)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        train_loss /= len(train_loader)
        train_losses.append(train_loss)

        # Validation step
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for xb, yb in val_loader:
                xb, yb = xb.to(device), yb.to(device)
                preds = model(xb)
                loss = criterion(preds, yb)
                val_loss += loss.item()
        val_loss /= len(val_loader)
        val_losses.append(val_loss)

        # Logging and output
        if verbose:
            print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        if writer:
            writer.add_scalar("Loss/Train", train_loss, epoch)
            writer.add_scalar("Loss/Val", val_loss, epoch)

    # Close TensorBoard writer
    if writer:
        writer.close()

    return model, train_losses, val_losses

III. Define evaluate_model() with Optional Predictions¶

In [ ]:
import numpy as np

# Evaluation function to compute MSE, RMSE, MAE (with predictions)
def evaluate_model(model, data_loader, criterion=nn.MSELoss(), device=device, return_predictions=False):
    model.eval()
    model.to(device)

    preds, targets = [], []

    # Inference loop (no gradients)
    with torch.no_grad():
        for xb, yb in data_loader:
            xb, yb = xb.to(device), yb.to(device)
            pred = model(xb)
            preds.append(pred.cpu())
            targets.append(yb.cpu())

    # Concatenate all predictions and targets
    preds = torch.cat(preds).squeeze()
    targets = torch.cat(targets).squeeze()

    # Compute evaluation metrics
    mse = torch.mean((preds - targets) ** 2).item()
    rmse = np.sqrt(mse)
    mae = torch.mean(torch.abs(preds - targets)).item()

    # return raw predictions and targets
    if return_predictions:
        return mse, rmse, mae, preds.numpy(), targets.numpy()
    return mse, rmse, mae

IV. Run the Vanilla Transformer Training¶

In [ ]:
# Data
seq_len = 60
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
    df_scaled, feature_cols, seq_len=seq_len
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)

# Model
model = TimeSeriesTransformer(
    input_dim=input_size,
    model_dim=64,
    num_heads=4,
    num_layers=2,
    dropout=0.1
)


# Optimizer and Loss
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Train
model, train_losses, val_losses = train_model(
    model, train_loader, val_loader,
    criterion=criterion,
    optimizer=optimizer,
    epochs=50,
    log_to_tensorboard=True,
    config_name="transformer_baseline"
)

# Evaluate
mse, rmse, mae, y_pred, y_true = evaluate_model(model, test_loader, return_predictions=True)
print(f"\n[Transformer] Test MSE: {mse:.4f} | RMSE: {rmse:.4f} | MAE: {mae:.4f}")
Epoch 1/50 | Train Loss: 1.0560 | Val Loss: 1.4010
Epoch 2/50 | Train Loss: 1.0423 | Val Loss: 1.4353
Epoch 3/50 | Train Loss: 1.0285 | Val Loss: 1.4041
Epoch 4/50 | Train Loss: 1.0015 | Val Loss: 1.8611
Epoch 5/50 | Train Loss: 1.0466 | Val Loss: 1.3770
Epoch 6/50 | Train Loss: 1.0287 | Val Loss: 1.4237
Epoch 7/50 | Train Loss: 1.0182 | Val Loss: 1.5937
Epoch 8/50 | Train Loss: 1.0014 | Val Loss: 1.4345
Epoch 9/50 | Train Loss: 1.0035 | Val Loss: 1.4745
Epoch 10/50 | Train Loss: 0.9816 | Val Loss: 2.2235
Epoch 11/50 | Train Loss: 1.2082 | Val Loss: 1.5907
Epoch 12/50 | Train Loss: 1.0278 | Val Loss: 1.3937
Epoch 13/50 | Train Loss: 1.0224 | Val Loss: 1.4201
Epoch 14/50 | Train Loss: 0.9977 | Val Loss: 1.5446
Epoch 15/50 | Train Loss: 1.0044 | Val Loss: 1.5680
Epoch 16/50 | Train Loss: 0.9909 | Val Loss: 2.3362
Epoch 17/50 | Train Loss: 0.9856 | Val Loss: 1.4531
Epoch 18/50 | Train Loss: 0.9765 | Val Loss: 2.3087
Epoch 19/50 | Train Loss: 1.0040 | Val Loss: 1.3811
Epoch 20/50 | Train Loss: 1.0108 | Val Loss: 1.3955
Epoch 21/50 | Train Loss: 0.9922 | Val Loss: 1.6466
Epoch 22/50 | Train Loss: 0.9924 | Val Loss: 1.3715
Epoch 23/50 | Train Loss: 0.9860 | Val Loss: 1.4310
Epoch 24/50 | Train Loss: 0.9465 | Val Loss: 1.3904
Epoch 25/50 | Train Loss: 1.0154 | Val Loss: 1.5313
Epoch 26/50 | Train Loss: 1.0188 | Val Loss: 1.4463
Epoch 27/50 | Train Loss: 1.0072 | Val Loss: 1.3698
Epoch 28/50 | Train Loss: 1.0430 | Val Loss: 1.3724
Epoch 29/50 | Train Loss: 1.0412 | Val Loss: 1.3713
Epoch 30/50 | Train Loss: 1.0392 | Val Loss: 1.3764
Epoch 31/50 | Train Loss: 1.0349 | Val Loss: 1.3832
Epoch 32/50 | Train Loss: 1.0198 | Val Loss: 1.4785
Epoch 33/50 | Train Loss: 1.0165 | Val Loss: 1.3724
Epoch 34/50 | Train Loss: 1.0106 | Val Loss: 1.3792
Epoch 35/50 | Train Loss: 1.0008 | Val Loss: 1.4077
Epoch 36/50 | Train Loss: 0.9362 | Val Loss: 1.4173
Epoch 37/50 | Train Loss: 0.9250 | Val Loss: 1.3831
Epoch 38/50 | Train Loss: 0.9485 | Val Loss: 1.4550
Epoch 39/50 | Train Loss: 1.0614 | Val Loss: 1.3693
Epoch 40/50 | Train Loss: 1.0400 | Val Loss: 1.3721
Epoch 41/50 | Train Loss: 1.0382 | Val Loss: 1.3725
Epoch 42/50 | Train Loss: 1.0391 | Val Loss: 1.3731
Epoch 43/50 | Train Loss: 1.0357 | Val Loss: 1.3729
Epoch 44/50 | Train Loss: 1.0270 | Val Loss: 1.3948
Epoch 45/50 | Train Loss: 0.9587 | Val Loss: 1.4321
Epoch 46/50 | Train Loss: 0.9860 | Val Loss: 1.3714
Epoch 47/50 | Train Loss: 0.9215 | Val Loss: 2.2381
Epoch 48/50 | Train Loss: 1.0207 | Val Loss: 1.3849
Epoch 49/50 | Train Loss: 1.0562 | Val Loss: 1.3737
Epoch 50/50 | Train Loss: 1.0406 | Val Loss: 1.3745

[Transformer] Test MSE: 0.4800 | RMSE: 0.6929 | MAE: 0.5204

The vanilla Transformer model achieved a test RMSE of 0.6929 and MAE of 0.5204 after 50 epochs. Its performance is on par with the best-tuned LSTM and GRU models, indicating its ability to effectively capture temporal dependencies even without recurrence. This provides a strong baseline for exploring more advanced transformer-based architectures with enhanced temporal encoding and regularization.

V. Saving the Experiment¶

In [ ]:
import os
import torch
import json
import pickle
import numpy as np
import pandas as pd

def save_experiment(
    model,
    config,
    train_losses=None,
    val_losses=None,
    y_true=None,
    y_pred=None,
    output_dir="experiment_transformer_vanilla",
    model_filename="project_weights_transformer_vanilla.pt"
):
    os.makedirs(output_dir, exist_ok=True)

    # Save model weights
    model_path = os.path.join(output_dir, model_filename)
    torch.save(model.state_dict(), model_path)

    # Save config
    config_path = os.path.join(output_dir, "best_config.json")
    with open(config_path, "w") as f:
        json.dump(config, f, indent=4)

    # Save training history
    if train_losses is not None and val_losses is not None:
        history_path = os.path.join(output_dir, "training_history.pkl")
        with open(history_path, "wb") as f:
            pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)

    # Save predictions
    if y_true is not None and y_pred is not None:
        df_preds = pd.DataFrame({
            "Actual": np.array(y_true),
            "Predicted": np.array(y_pred)
        })
        df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)

    print(f"Transformer experiment saved to: {output_dir}")

Saving...

In [ ]:
# Define the config
transformer_config = {
    "model_dim": 64,
    "num_heads": 4,
    "num_layers": 2,
    "dropout": 0.1,
    "seq_len": 60,
    "lr": 0.001
}

# Save experiment
save_experiment(
    model=model,
    config=transformer_config,
    train_losses=train_losses,
    val_losses=val_losses,
    y_true=y_true,
    y_pred=y_pred,
    output_dir="experiment_transformer_vanilla",
    model_filename="project_weights_transformer_vanilla.pt"
)

VI. Plots¶

In [ ]:
import os
import torch
import json
import pickle
import pandas as pd
import matplotlib.pyplot as plt

# Define paths
exp_dir = "experiment_transformer_vanilla"
history_path = os.path.join(exp_dir, "training_history.pkl")
preds_path = os.path.join(exp_dir, "test_predictions.csv")

# Load training history
with open(history_path, "rb") as f:
    history = pickle.load(f)

train_losses = history["train_losses"]
val_losses = history["val_losses"]

# Load predictions
df_preds = pd.read_csv(preds_path)

1. Tensorboard Visualization

In [ ]:
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.
  1. Plot Train vs. Validation Loss
In [ ]:
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Transformer: Training & Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The training loss shows a stable downward trend, while the validation loss fluctuates significantly, indicating potential sensitivity to initialization or data noise. Despite this variance, the model converges to a competitive performance level, suggesting that the Transformer can generalize well with further tuning or regularization.

  1. Predicted vs Actual Log Returns (Line Plot)
In [ ]:
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("Transformer: Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The predicted values remain close to the zero line, indicating the Transformer struggles to capture the magnitude of volatility in the data. While it approximates the trend center well, it fails to react to large fluctuations, a common limitation when models are trained with MSE loss on noisy financial series. This reinforces the need for uncertainty-aware or regularized architectures.

  1. Scatter Plot (Actual vs Predicted)
In [ ]:
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='green')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--')
plt.title("Transformer: Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.axis("equal")
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid(True)
plt.show()
WARNING:matplotlib.axes._base:Ignoring fixed x limits to fulfill fixed data aspect with adjustable data limits.
No description has been provided for this image

The predictions are heavily clustered near zero, highlighting the model’s tendency to underreact to large deviations. This underdispersion suggests that while the Transformer captures the central trend well, it struggles with accurately modeling high-volatility movements underscoring the importance of incorporating uncertainty estimation or variance-aware loss functions in future enhancements.


Transformer (Regularized)¶


I. Define the Transformer¶

In [ ]:
import torch
import torch.nn as nn
import math

# Positional encoding to inject temporal order into input embeddings
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=1000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0) # (1, max_len, d_model)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)].to(x.device)

# Transformer model with dropout regularization and MC Dropout for uncertainty estimation
class TransformerRegularized(nn.Module):
    def __init__(
        self,
        input_dim,
        model_dim=64,
        num_heads=4,
        num_layers=3,
        dropout=0.2,
        ff_dim=128,
        mc_dropout= False # activates dropout during inference
    ):
        super().__init__()
        self.mc_dropout = mc_dropout  # enables dropout at inference

         # Input projection + normalization + regularization
        self.input_proj = nn.Sequential(
            nn.Linear(input_dim, model_dim),
            nn.LayerNorm(model_dim),
            nn.Dropout(dropout)
        )
        self.pos_encoder = PositionalEncoding(model_dim)

        # Transformer encoder with pre-layer normalization for better convergence
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=model_dim,
            nhead=num_heads,
            dim_feedforward=ff_dim,
            dropout=dropout,
            batch_first=True,
            norm_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # Output regression head with additional LayerNorm and Dropout
        self.regressor = nn.Sequential(
            nn.LayerNorm(model_dim),
            nn.Linear(model_dim, 32),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(32, 1)
        )

    def forward(self, x):
        x = self.input_proj(x)
        x = self.pos_encoder(x)
        x = self.transformer(x)
        x = x[:, -1, :]  # use the final time step representation

         # Enable dropout during inference for MC Dropout sampling
        if self.mc_dropout:
            for m in self.regressor:
                if isinstance(m, nn.Dropout):
                    m.train()  # keep dropout active during inference

        return self.regressor(x)

Monte Carlo Dropout Prediction Function

In [ ]:
import torch
import numpy as np

def predict_mc_dropout(model, data_loader, device, n_samples=100):
    model.eval()
    model.mc_dropout = True
    model.to(device)

    all_preds = []

    with torch.no_grad():
        for _ in range(n_samples):
            preds = []
            for xb, _ in data_loader:
                xb = xb.to(device)
                pred = model(xb)
                preds.append(pred.cpu())
            preds = torch.cat(preds, dim=0).squeeze().numpy()
            all_preds.append(preds)

    return np.array(all_preds)  # shape: [n_samples, n_points]

Compute VaR & Expected Shortfall

In [ ]:
import numpy as np

def compute_var_es_mc(predictions, alpha=0.05):

    mean = predictions.mean(axis=0)
    std = predictions.std(axis=0)

    # Compute VaR at each time step (percentile across samples)
    var = np.percentile(predictions, 100 * alpha, axis=0)

    # Compute ES per time step
    es = []
    for t in range(predictions.shape[1]):
        below_var = predictions[:, t][predictions[:, t] < var[t]]
        es_t = below_var.mean() if len(below_var) > 0 else var[t]  # fallback to VaR if no values below
        es.append(es_t)

    es = np.array(es)
    return mean, std, var, es

II. Training & Evaluation Setup¶

In [ ]:
import torch
import time
from torch.utils.tensorboard import SummaryWriter

def train_model_with_early_stopping(
    model,
    train_loader,
    val_loader,
    criterion,
    optimizer,
    epochs=100,
    device=device,
    patience=10,
    log_to_tensorboard=True,
    config_name="transformer_regularized"
):
    model.to(device)
    best_val_loss = float('inf')
    best_model_state = None
    counter = 0

    train_losses, val_losses = [], []

    if log_to_tensorboard:
        writer = SummaryWriter(log_dir=f"runs/{config_name}")

    for epoch in range(1, epochs + 1):
        model.train()
        epoch_train_loss = 0
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            preds = model(xb)
            loss = criterion(preds, yb)
            loss.backward()
            optimizer.step()
            epoch_train_loss += loss.item() * xb.size(0)

        epoch_train_loss /= len(train_loader.dataset)
        train_losses.append(epoch_train_loss)

        # Validation
        model.eval()
        epoch_val_loss = 0
        with torch.no_grad():
            for xb, yb in val_loader:
                xb, yb = xb.to(device), yb.to(device)
                preds = model(xb)
                loss = criterion(preds, yb)
                epoch_val_loss += loss.item() * xb.size(0)

        epoch_val_loss /= len(val_loader.dataset)
        val_losses.append(epoch_val_loss)

        # Logging
        if log_to_tensorboard:
            writer.add_scalars(f"{config_name}/loss", {
                "Train": epoch_train_loss,
                "Val": epoch_val_loss
            }, epoch)

        print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {epoch_train_loss:.4f} | Val Loss: {epoch_val_loss:.4f}")

        # Early stopping check
        if epoch_val_loss < best_val_loss:
            best_val_loss = epoch_val_loss
            best_model_state = model.state_dict()
            counter = 0
        else:
            counter += 1
            if counter >= patience:
                print(f"Early stopping at epoch {epoch}")
                break

    if log_to_tensorboard:
        writer.close()

    # Restore best model
    if best_model_state is not None:
        model.load_state_dict(best_model_state)

    return model, train_losses, val_losses

IV. Run the Transformer Training¶

In [ ]:
SEQ_LEN = 60

X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
    df_scaled, feature_cols, seq_len=SEQ_LEN
)

train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)


input_dim = X_train.shape[2]

model = TransformerRegularized(
    input_dim=input_dim,
    model_dim=128,
    num_heads=4,
    num_layers=4,
    dropout=0.1,
    ff_dim=256,
    mc_dropout=True
)

import torch.nn as nn

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)

model, train_losses, val_losses = train_model_with_early_stopping(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    criterion=criterion,
    optimizer=optimizer,
    epochs=100,
    patience=10,
    log_to_tensorboard=True,
    config_name="TransformerRegularized"
)

mse, rmse, mae, y_pred, y_true = evaluate_model(
    model, test_loader, criterion=nn.MSELoss(), return_predictions=True
)

print(f"\nFinal Test Results — TransformerRegularized:")
print(f"MSE : {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE : {mae:.4f}")
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/transformer.py:385: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.norm_first was True
  warnings.warn(
Epoch 01/100 | Train Loss: 1.0498 | Val Loss: 1.4501
Epoch 02/100 | Train Loss: 1.0412 | Val Loss: 1.4330
Epoch 03/100 | Train Loss: 1.0323 | Val Loss: 1.5041
Epoch 04/100 | Train Loss: 1.0415 | Val Loss: 1.4737
Epoch 05/100 | Train Loss: 1.0337 | Val Loss: 1.4898
Epoch 06/100 | Train Loss: 1.0331 | Val Loss: 2.2935
Epoch 07/100 | Train Loss: 1.0383 | Val Loss: 1.4732
Epoch 08/100 | Train Loss: 1.0224 | Val Loss: 1.4272
Epoch 09/100 | Train Loss: 1.0367 | Val Loss: 1.4290
Epoch 10/100 | Train Loss: 1.0325 | Val Loss: 1.4468
Epoch 11/100 | Train Loss: 1.0189 | Val Loss: 1.4796
Epoch 12/100 | Train Loss: 1.0111 | Val Loss: 1.5031
Epoch 13/100 | Train Loss: 1.0042 | Val Loss: 1.4885
Epoch 14/100 | Train Loss: 1.0178 | Val Loss: 1.5225
Epoch 15/100 | Train Loss: 1.0110 | Val Loss: 1.4446
Epoch 16/100 | Train Loss: 0.9777 | Val Loss: 1.6107
Epoch 17/100 | Train Loss: 0.9817 | Val Loss: 1.7964
Epoch 18/100 | Train Loss: 1.0004 | Val Loss: 2.4265
Early stopping at epoch 18

Final Test Results — TransformerRegularized:
MSE : 0.4953
RMSE: 0.7038
MAE : 0.5318

The regularized Transformer with Monte Carlo Dropout achieved a test RMSE of 0.7038 and MAE of 0.5318. While slightly less accurate than the best-tuned GRU/LSTM and vanilla Transformer, it offers the added advantage of predictive uncertainty through stochastic forward passes. This trade-off between slight performance cost and richer model interpretability is valuable in financial forecasting, where confidence intervals and risk quantification are critical. The model serves as a strong foundation for further risk-aware extensions.

Compute VaR and ES

In [ ]:
mc_preds = predict_mc_dropout(model, test_loader, device, n_samples=100)
mean_pred, std_pred, var_95, es_95 = compute_var_es_mc(mc_preds, alpha=0.05)

Plotting

In [ ]:
plt.figure(figsize=(12, 5))
plt.plot(y_true, label="Actual", alpha=0.8)
plt.plot(mean_pred, label="Mean Prediction", color='orange')
plt.plot(var_95, label="VaR (95%)", color='red', linestyle='--')
plt.plot(es_95, label="ES (95%)", color='purple', linestyle='dashed')
plt.fill_between(range(len(mean_pred)), mean_pred - 2*std_pred, mean_pred + 2*std_pred, alpha=0.2, label="±2 std")
plt.title("MC Dropout: Mean Prediction, VaR, and Expected Shortfall")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

This plot demonstrates the model’s ability to not only provide point forecasts but also meaningful confidence intervals. VaR and Expected Shortfall dynamically adjust based on model uncertainty, particularly in high-volatility regions. While the predictive mean is conservative and smoother than actual returns, the model offers valuable insight into potential downside risk, essential for risk-aware decision-making.

V. Save the Experiment¶

In [ ]:
save_experiment(
    model=model,
    config={"seq_len": SEQ_LEN, "model_dim": 128, "num_heads": 4, "num_layers": 4,
            "dropout": 0.1, "ff_dim": 256, "lr": 0.0005},
    train_losses=train_losses,
    val_losses=val_losses,
    y_true=y_true,
    y_pred=y_pred,
    output_dir="experiment_transformer_regularized",
    model_filename="project_weights_transformer_regularized.pt"
)
Final experiment saved to: experiment_transformer_regularized

VI. Plots¶

1. TensorBoard Visualization

In [ ]:
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.

2. Plot: Training vs Validation Loss

In [ ]:
import pickle
import matplotlib.pyplot as plt

# Load training history
with open("experiment_transformer_regularized/training_history.pkl", "rb") as f:
    history = pickle.load(f)

train_losses = history["train_losses"]
val_losses = history["val_losses"]

# Plot
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Transformer (Regularized): Training & Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The training loss decreases steadily, while the validation loss exhibits fluctuations, a typical pattern when dropout-based regularization is active. Early stopping was triggered at epoch 18 once the validation loss began increasing persistently, preventing further overfitting. This strategy preserved generalization and resulted in a final test RMSE of 0.7038, with added benefits of uncertainty-aware forecasting and risk quantification via VaR and Expected Shortfall.

3. Plot: Predicted vs Actual (Line Plot)

In [ ]:
import pandas as pd

# Load predictions
df_preds = pd.read_csv("experiment_transformer_regularized/test_predictions.csv")

# Plot
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("Transformer (Regularized): Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The predicted series closely follows the overall trend of the actual returns, especially in low-volatility periods. While extreme fluctuations are still underpredicted, the model demonstrates improved responsiveness compared to previous baselines. This smoothing is expected in probabilistic forecasts, where the mean prediction serves as a central tendency, and uncertainty is captured separately via predictive intervals. The result supports the model’s use in risk-aware forecasting rather than exact value prediction.

4. Plot: Scatter — Actual vs Predicted

In [ ]:
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='purple')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--')  # identity line
plt.title("Transformer (Regularized): Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()
No description has been provided for this image

The predictions show a clear concentration around zero, consistent with a tendency to regress toward the mean. While there's reasonable alignment with the diagonal in moderate return ranges, the model underpredicts more extreme values, especially in the tails. This conservative pattern is expected in models optimized for risk-aware forecasting, where the goal is to capture distributional characteristics rather than individual spikes. The spread is tighter than in prior baselines, suggesting improved calibration.


Transformer (Final Architecture)¶


I. Model Architecture¶

Final Architecture: Patch-based Transformer with Monte Carlo Dropout¶

This model introduces a patch-based input strategy inspired by Vision Transformers (ViT) and recent time series architectures like PatchTST. It embeds non-overlapping time patches, applies positional encoding, and processes them with stacked Transformer encoder blocks. Monte Carlo Dropout is enabled during inference to support predictive uncertainty estimation and risk-aware forecasting.

In [ ]:
import torch
import torch.nn as nn
import math


# Patch-based input embedding inspired by Vision Transformers (ViT)
class PatchEmbedding(nn.Module):
    def __init__(self, input_dim, patch_len, model_dim):
        super().__init__()
        self.patch_len = patch_len
        self.proj = nn.Linear(input_dim * patch_len, model_dim)

    def forward(self, x):
        # x: [B, T, D] => reshape into non-overlapping patches
        B, T, D = x.shape
        assert T % self.patch_len == 0, "Time series length must be divisible by patch_len"
        num_patches = T // self.patch_len
        x = x.view(B, num_patches, D * self.patch_len)
        return self.proj(x)


# Same as before, add position information to each token
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=500):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)].to(x.device)

# Patch-based Transformer with MC Dropout and global average pooling
class ForecastingTransformer(nn.Module):
    def __init__(
        self,
        input_dim,
        model_dim=128,
        patch_len=10,
        num_heads=4,
        num_layers=3,
        ff_dim=256,
        dropout=0.1,
        output_dim=1,
        mc_dropout=True
    ):
        super().__init__()
        self.mc_dropout = mc_dropout
        # Patchify time series and project into model_dim space
        self.embedding = PatchEmbedding(input_dim, patch_len, model_dim)
        # Add positional encoding to patches
        self.pos_encoding = PositionalEncoding(model_dim)

        # Transformer encoder with norm-first setting
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=model_dim,
            nhead=num_heads,
            dim_feedforward=ff_dim,
            dropout=dropout,
            batch_first=True,
            norm_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # Output head with LayerNorm, Dropout, and MLP
        self.output_head = nn.Sequential(
            nn.LayerNorm(model_dim),
            nn.Dropout(dropout),
            nn.Linear(model_dim, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, output_dim)
        )

    def forward(self, x):
        # x: [B, T, D] => [B, Num_Patches, Model_Dim]
        x = self.embedding(x)
        x = self.pos_encoding(x)
        x = self.transformer(x)

        # Global average pooling over patches instead of last token
        x = x.mean(dim=1)

        # Enable MC Dropout during inference
        if self.mc_dropout:
            for m in self.output_head:
                if isinstance(m, nn.Dropout):
                    m.train()
        return self.output_head(x)

mc_dropout_predict() Utility Function¶

In [ ]:
def mc_dropout_predict(model, loader, n_samples=50, device='cuda'):
    model.eval()
    model.to(device)

    preds_mc = []
    with torch.no_grad():
        for _ in range(n_samples):
            batch_preds = []
            for xb, _ in loader:
                xb = xb.to(device)
                pred = model(xb)  # Dropout is active due to mc_dropout flag
                batch_preds.append(pred.cpu())
            preds_mc.append(torch.cat(batch_preds).squeeze(1))

    return torch.stack(preds_mc)  # Shape: [n_samples, N]

Utility Function for VaR & ES¶

In [ ]:
import numpy as np

# Computes Value at Risk (VaR) and Expected Shortfall (ES) per time step using MC samples (n_samples x N).
def compute_var_es(mc_samples, alpha=0.05):

    # Convert to numpy
    preds = mc_samples.numpy()  # shape: [n_samples, N]
    var = np.quantile(preds, alpha, axis=0)
    es = preds[preds <= var].mean(axis=0)  # Conditional tail expectation
    return var, es

II. Training Setup¶

In [ ]:
from torch.utils.tensorboard import SummaryWriter
import time

def train_model(
    model,
    train_loader,
    val_loader,
    criterion,
    optimizer,
    device,
    epochs=50,
    patience=8,
    log_dir=None
):
    model.to(device)
    best_val_loss = float("inf")
    best_model_state = None
    counter = 0
    train_losses, val_losses = [], []

    writer = SummaryWriter(log_dir=log_dir) if log_dir else None

    for epoch in range(epochs):
        model.train()
        train_loss = 0.0
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            preds = model(xb)
            loss = criterion(preds, yb)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        train_loss /= len(train_loader)
        train_losses.append(train_loss)

        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for xb, yb in val_loader:
                xb, yb = xb.to(device), yb.to(device)
                preds = model(xb)
                loss = criterion(preds, yb)
                val_loss += loss.item()
        val_loss /= len(val_loader)
        val_losses.append(val_loss)

        if writer:
            writer.add_scalar("Loss/Train", train_loss, epoch)
            writer.add_scalar("Loss/Val", val_loss, epoch)

        print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = model.state_dict()
            counter = 0
        else:
            counter += 1
            if counter >= patience:
                print(f"Early stopping triggered at epoch {epoch+1}")
                break

    if writer:
        writer.close()

    model.load_state_dict(best_model_state)
    return model, train_losses, val_losses

III. Evaluation Function¶

In [ ]:
import numpy as np
import torch

def evaluate_model(model, data_loader, criterion=nn.MSELoss(), device="cpu", return_predictions=False):
    model.eval()
    model.to(device)

    preds, targets = [], []

    with torch.no_grad():
        for xb, yb in data_loader:
            xb, yb = xb.to(device), yb.to(device)
            pred = model(xb)
            preds.append(pred.cpu())
            targets.append(yb.cpu())

    preds = torch.cat(preds).squeeze()
    targets = torch.cat(targets).squeeze()

    mse = torch.mean((preds - targets) ** 2).item()
    rmse = np.sqrt(mse)
    mae = torch.mean(torch.abs(preds - targets)).item()

    if return_predictions:
        return mse, rmse, mae, preds.numpy(), targets.numpy()
    return mse, rmse, mae

IV. Training the Final Architecture¶

In [ ]:
# Define config
final_config = {
    "model_dim": 128,
    "patch_len": 10,
    "num_heads": 8,
    "num_layers": 3,
    "ff_dim": 256,
    "dropout": 0.2,
    "lr": 0.0005,
    "batch_size": 32,
    "epochs": 100,
    "patience": 10
}

# Prepare data
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
    df_scaled, feature_cols, seq_len=final_config["patch_len"] * 4
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, final_config["batch_size"])
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, final_config["batch_size"])

# Initialize model
input_dim = X_train.shape[2]
model = ForecastingTransformer(
    input_dim=input_dim,
    model_dim=final_config["model_dim"],
    patch_len=final_config["patch_len"],
    num_heads=final_config["num_heads"],
    num_layers=final_config["num_layers"],
    ff_dim=final_config["ff_dim"],
    dropout=final_config["dropout"]
)

optimizer = torch.optim.Adam(model.parameters(), lr=final_config["lr"])
criterion = nn.MSELoss()

# Train with early stopping
model, train_losses, val_losses = train_model(
    model,
    train_loader,
    val_loader,
    criterion,
    optimizer,
    device=device,
    epochs=final_config["epochs"],
    patience=final_config["patience"],
    log_dir="runs/final_transformer"
)

# Final evaluation
mse, rmse, mae, y_pred, y_true = evaluate_model(model, test_loader, return_predictions=True)
print(f"\nFinal Test Results — Transformer Final Architecture:")
print(f"MSE : {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE : {mae:.4f}")
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/transformer.py:385: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.norm_first was True
  warnings.warn(
Epoch 1/100 | Train Loss: 1.0441 | Val Loss: 1.4191
Epoch 2/100 | Train Loss: 1.0381 | Val Loss: 1.4102
Epoch 3/100 | Train Loss: 1.0216 | Val Loss: 1.4131
Epoch 4/100 | Train Loss: 1.0185 | Val Loss: 1.4353
Epoch 5/100 | Train Loss: 1.0138 | Val Loss: 1.4025
Epoch 6/100 | Train Loss: 0.9872 | Val Loss: 1.4650
Epoch 7/100 | Train Loss: 1.0218 | Val Loss: 1.6516
Epoch 8/100 | Train Loss: 1.0210 | Val Loss: 1.4030
Epoch 9/100 | Train Loss: 0.9826 | Val Loss: 4.2792
Epoch 10/100 | Train Loss: 1.0239 | Val Loss: 1.5222
Epoch 11/100 | Train Loss: 1.1201 | Val Loss: 1.3962
Epoch 12/100 | Train Loss: 1.0525 | Val Loss: 1.3979
Epoch 13/100 | Train Loss: 1.0288 | Val Loss: 1.3999
Epoch 14/100 | Train Loss: 1.0297 | Val Loss: 1.4162
Epoch 15/100 | Train Loss: 1.0308 | Val Loss: 1.4040
Epoch 16/100 | Train Loss: 1.0288 | Val Loss: 1.4056
Epoch 17/100 | Train Loss: 1.0203 | Val Loss: 1.4045
Epoch 18/100 | Train Loss: 1.0162 | Val Loss: 1.4049
Epoch 19/100 | Train Loss: 1.0105 | Val Loss: 1.4502
Epoch 20/100 | Train Loss: 0.9962 | Val Loss: 1.4845
Epoch 21/100 | Train Loss: 0.9911 | Val Loss: 1.5015
Early stopping triggered at epoch 21

Final Test Results — Transformer Final Architecture:
MSE : 0.6567
RMSE: 0.8104
MAE : 0.6271

The training and validation losses remained relatively stable across epochs, with early stopping triggered at epoch 21 to prevent overfitting. Despite some minor fluctuations and a spike at epoch 9 (likely due to stochastic dropout sampling or a volatile batch), the model quickly recovered. This behavior reflects the robustness of the patch-based Transformer architecture. While not achieving the lowest error among all models, it produced a well-regularized, generalizable model with an RMSE of 0.8104 and MAE of 0.6271. A strong result given the added benefits of modularity, uncertainty estimation, and deployment-readiness.

Run MC Sampling

In [ ]:
n_samples = 50  # Number of stochastic passes
mc_samples = mc_dropout_predict(model, test_loader, n_samples=n_samples)

Compute VaR and ES

In [ ]:
# Compute risk metrics at 95% confidence
alpha = 0.05
var_95, es_95 = compute_var_es(mc_samples, alpha=alpha)

Compute Pointwise Mean, Std, and True Values

In [ ]:
# Compute predictive mean and std
mean_preds = mc_samples.mean(dim=0)
std_preds = mc_samples.std(dim=0)

# Get ground truth
y_true_tensor = torch.cat([yb for _, yb in test_loader]).squeeze()

Save VaR and ES along with predictions:

In [ ]:
import pandas as pd

df_risk = pd.DataFrame({
    "Actual": y_true_tensor.numpy(),
    "Prediction": mean_preds.numpy(),
    "StdDev": std_preds.numpy(),
    f"VaR_{int((1-alpha)*100)}": var_95,
    f"ES_{int((1-alpha)*100)}": es_95
})
df_risk.to_csv("experiment_transformer_final/risk_metrics.csv", index=False)

Plot Uncertainty

In [ ]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(y_true_tensor, label="Actual")
plt.plot(mean_preds, label="Mean Prediction")
plt.fill_between(
    range(len(mean_preds)),
    mean_preds - 2 * std_preds,
    mean_preds + 2 * std_preds,
    color='green', alpha=0.3,
    label="±2 std (uncertainty)"
)
plt.title("Transformer Final: MC Dropout Prediction with Uncertainty")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The orange line shows the mean predicted log return, while the green band represents the ±2 standard deviations (approximate 95% confidence interval) across multiple stochastic forward passes. The model effectively captures higher uncertainty during volatile periods (e.g., near time steps 50, 200, 300), while expressing greater confidence during quieter intervals. This ability to quantify predictive uncertainty is critical in financial applications, where understanding confidence in a forecast can be as important as the forecast itself.

Plot VaR and ES

In [ ]:
plt.figure(figsize=(10, 5))
plt.plot(y_true_tensor, label="Actual", alpha=0.6)
plt.plot(mean_preds, label="Mean Prediction", alpha=0.8)
plt.plot(var_95, label=f"VaR ({int((1-alpha)*100)}%)", linestyle='--', color='red')
plt.axhline(y=es_95, color='purple', linestyle='--', label='ES (95%)')
plt.fill_between(
    range(len(mean_preds)),
    mean_preds - 2 * std_preds,
    mean_preds + 2 * std_preds,
    color='green', alpha=0.2,
    label="±2 std"
)
plt.title("MC Dropout: Mean Prediction, VaR, and Expected Shortfall")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

V. Save the Experiment¶

In [ ]:
import os, json, pickle
import pandas as pd
import numpy as np
import torch

def save_experiment(
    model, config, train_losses, val_losses, y_true, y_pred,
    output_dir="experiment_transformer_final", model_filename="project_weights_transformer_final.pt"
):
    os.makedirs(output_dir, exist_ok=True)

    # Save model weights
    torch.save(model.state_dict(), os.path.join(output_dir, model_filename))

    # Save config
    with open(os.path.join(output_dir, "best_config.json"), "w") as f:
        json.dump(config, f, indent=4)

    # Save training history
    with open(os.path.join(output_dir, "training_history.pkl"), "wb") as f:
        pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)

    # Save predictions
    df_preds = pd.DataFrame({
        "Actual": np.array(y_true),
        "Predicted": np.array(y_pred)
    })
    df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)

    print(f"Final experiment saved to: {output_dir}")

Saving...

In [ ]:
save_experiment(
    model=model,
    config=final_config,
    train_losses=train_losses,
    val_losses=val_losses,
    y_true=y_true,
    y_pred=y_pred,
    output_dir="experiment_transformer_final",
    model_filename="project_weights_transformer_final.pt"
)
Final experiment saved to: experiment_transformer_final

VI. Plot¶

  1. TensorBoard Visualization
In [ ]:
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.
In [ ]:
import pickle
import pandas as pd
import matplotlib.pyplot as plt

# Load training history
with open("experiment_transformer_final/training_history.pkl", "rb") as f:
    history = pickle.load(f)
train_losses = history["train_losses"]
val_losses = history["val_losses"]

# Load predictions
df_preds = pd.read_csv("experiment_transformer_final/test_predictions.csv")

2. Training vs Validation Loss Plot

In [ ]:
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Transformer (Final): Training & Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The training loss steadily decreases, while validation loss remains mostly stable, aside from a sharp spike at epoch 8. This outlier is likely due to a noisy batch or MC Dropout sampling variability, common when using stochastic inference. The model quickly recovers, and early stopping at epoch 21 helps avoid overfitting. Overall, the training behavior is stable and well-regularized, reflecting the robustness of the patch-based design.

3. Line Plot: Predicted vs Actual (Log Returns)

In [ ]:
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("Transformer (Final): Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The predicted values align closely with the general direction of the actual returns , especially during stable periods. While the model continues to smooth out some extreme movements, a common behavior in MSE-optimized regressors. It effectively captures broader trends and turning points. This balance between fidelity and stability makes it well-suited for integration with downstream risk metrics like VaR and Expected Shortfall.

4. Scatter Plot: Actual vs Predicted

In [ ]:
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='darkorange')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--')  # identity line
plt.title("Transformer (Final): Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()
No description has been provided for this image

The predictions generally cluster around the origin, showing that the model captures the central tendency well. There’s reasonable alignment with the diagonal line, especially for moderate returns. As with other models, the extreme values are slightly underpredicted, a common effect in models trained with MSE loss. Overall, this plot confirms that the model is well-calibrated for typical return ranges and reasonably responsive to directional movement.

VII. Comparisons¶

1. Tuned LSTM vs GRU

1.1 Actual Vs Predictions

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Load predictions from saved CSVs
lstm_df = pd.read_csv("/content/experiment_lstm_tuned/test_predictions.csv")
gru_df = pd.read_csv("/content/experiment_gru_tuned/test_predictions.csv")

# Extract actual and predicted values
y_true = lstm_df["Actual"]  # Have same ground truth
y_lstm = lstm_df["Predicted"]
y_gru = gru_df["Predicted"]

# Plot
plt.figure(figsize=(12, 5))
plt.plot(y_true, label="Actual", linewidth=1)
plt.plot(y_lstm, label="LSTM Predicted", color="orange")
plt.plot(y_gru, label="GRU Predicted", color="green")

plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.title("Actual vs LSTM vs GRU Log Returns on Test Set")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

1.2 Training and Validation Losses

In [ ]:
import pickle
import matplotlib.pyplot as plt

# Load LSTM training history
with open("/content/experiment_lstm_tuned/training_history.pkl", "rb") as f:
    lstm_history = pickle.load(f)

# Load GRU training history
with open("/content/experiment_gru_tuned/training_history.pkl", "rb") as f:
    gru_history = pickle.load(f)

# Plot loss curves
fig, axs = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

# LSTM Plot
axs[0].plot(lstm_history["train_losses"], label="Train Loss")
axs[0].plot(lstm_history["val_losses"], label="Val Loss", color="orange")
axs[0].set_title("Tuned LSTM Loss")
axs[0].set_xlabel("Epoch")
axs[0].set_ylabel("MSE Loss")
axs[0].legend()
axs[0].grid(True)

# GRU Plot
axs[1].plot(gru_history["train_losses"], label="Train Loss")
axs[1].plot(gru_history["val_losses"], label="Val Loss", color="orange")
axs[1].set_title("Tuned GRU Loss")
axs[1].set_xlabel("Epoch")
axs[1].legend()
axs[1].grid(True)

plt.suptitle("Training vs. Validation Loss Comparison: Tuned LSTM vs GRU")
plt.tight_layout()
plt.savefig("lstm_gru_loss_comparison.png")
plt.show()
No description has been provided for this image

2. Transfomer Models

2.1 Actual Vs Predictions

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Load prediction files
df_vanilla = pd.read_csv("/content/experiment_transformer_vanilla/test_predictions.csv")
df_regularized = pd.read_csv("/content/experiment_transformer_regularized/test_predictions.csv")
df_final = pd.read_csv("/content/experiment_transformer_final/test_predictions.csv")


# Plot
plt.figure(figsize=(14, 5))
plt.plot(df_vanilla["Actual"], label="Actual", color='steelblue', linewidth=1)
plt.plot(df_vanilla["Predicted"], label="Vanilla Transformer", color='orange', linewidth=1)
plt.plot(df_regularized["Predicted"], label="Regularized Transformer", color='purple', linewidth=1)
plt.plot(df_final["Predicted"], label="Patch-based Transformer", color='green', linewidth=1)

plt.title("Actual vs Transformer Predictions (All Variants)")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

2.2 Training and Validation Losses

In [ ]:
import os
import pickle
import matplotlib.pyplot as plt

# Define paths to pkl files
paths = {
    "Vanilla Transformer": "/content/experiment_transformer_vanilla/training_history.pkl",
    "Regularized Transformer": "/content/experiment_transformer_regularized/training_history.pkl",
    "Patch-based Transformer": "/content/experiment_transformer_final/training_history.pkl"
}

# Define distinct colors
colors = {
    "Vanilla Transformer": "orange",
    "Regularized Transformer": "purple",
    "Patch-based Transformer": "green"
}

# Initialize dictionary to store losses
losses = {}

# Load data from each model
for model_name, path in paths.items():
    with open(path, "rb") as f:
        history = pickle.load(f)
        losses[model_name] = {
            "train": history["train_losses"],
            "val": history["val_losses"]
        }

# Plot side-by-side comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5), sharey=True)

# Plot training losses
for model_name in losses:
    ax1.plot(losses[model_name]["train"], label=model_name, color=colors[model_name])
ax1.set_title("Training Loss Comparison (MSE)")
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax1.legend()
ax1.grid(True)

# Plot validation losses
for model_name in losses:
    ax2.plot(losses[model_name]["val"], label=model_name, color=colors[model_name])
ax2.set_title("Validation Loss Comparison (MSE)")
ax2.set_xlabel("Epoch")
ax2.legend()
ax2.grid(True)

plt.suptitle("Training vs. Validation Loss for Transformer Variants")
plt.tight_layout()
plt.show()
No description has been provided for this image

Model Architecture Summary¶

This project evaluates multiple deep learning architectures for financial time series forecasting. Below is a summary of the models implemented and compared:

Model Name Type Key Features
LSTMModel Recurrent (LSTM) Captures long-term dependencies using memory cells and gates.
GRUModel Recurrent (GRU) A simplified version of LSTM with fewer parameters and similar performance.
TimeSeriesTransformer Transformer Vanilla Transformer with positional encoding and self-attention.
TransformerRegularized Transformer Adds LayerNorm, dropout regularization, and optional MC Dropout inference.
ForecastingTransformer Patch-based Transformer Inspired by PatchTST. Uses patch embedding, positional encoding, and global pooling.
  • All models use a sliding window of past 30, 40, or 60 days of technical indicators, depending on the configuration.

    • Baseline models use seq_len=30.
    • Tuned LSTM and GRU models, as well as the Vanilla and Regularized Transformers, use seq_len=60, as this yielded better results during grid search for both LSTM & GRU models.
    • The Final Patch-based Transformer uses seq_len=40, derived from patch_len=10 × 4, following the patching design inspired by Vision Transformers (ViTs).
  • Monte Carlo Dropout is applied to TransformerRegularized and ForecastingTransformer for uncertainty estimation.

  • VaR and ES are computed from the predictive distributions to assess financial risk.

Referencees¶

  1. Vaswani et al. (2017) — Attention is All You Need Introduced the Transformer architecture used in all advanced models. https://arxiv.org/abs/1706.03762

  2. Hochreiter & Schmidhuber (1997) — Long Short-Term Memory Foundation for LSTM baseline. https://www.bioinf.jku.at/publications/older/2604.pdf

  3. Cho et al. (2014) — Gated Recurrent Unit (GRU) Basis for the GRU model. https://arxiv.org/abs/1409.1259

  4. Gal & Ghahramani (2016) — Dropout as a Bayesian Approximation, the implementation of Monte Carlo Dropout inference based on this work. https://arxiv.org/abs/1506.02142

  5. Rockafellar & Uryasev (2000) — Conditional Value at Risk (CVaR), the VaR and ES computations are grounded in this risk framework. https://doi.org/10.21314/JOR.2000.038

  6. Nie et al. (2023) — PatchTST: Forecasting with Patch Attention, the patch-based Transformer is conceptually inspired by this work. https://arxiv.org/abs/2211.14730

  7. Libraries and APIs

  • Yahoo Finance API (via yfinance) – Used to fetch S&P 500 OHLCV stock data.
  • PyTorch Documentation – TransformerEncoder – Used to build Transformer encoder layers.
  • PyTorch Documentation – Dropout – Used in MC Dropout at inference.
  • PyTorch Documentation – LayerNorm – Used for normalization in regularized and patch-based models.
  • scikit-learn – Used for data normalization (StandardScaler) and metrics like MAE, MSE, RMSE.
  • matplotlib – Used for visualizing training curves, prediction results, and uncertainty histograms.
  • TensorBoard (PyTorch Integration) – Used to log training/validation loss for analysis.